You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by pa...@apache.org on 2016/09/13 20:02:30 UTC
mahout git commit: MAHOUT-1853: Add new thresholds and partitioning
methods to SimilarityAnalysis
Repository: mahout
Updated Branches:
refs/heads/master 3351b75b3 -> b5fe4aab2
MAHOUT-1853: Add new thresholds and partitioning methods to SimilarityAnalysis
Project: http://git-wip-us.apache.org/repos/asf/mahout/repo
Commit: http://git-wip-us.apache.org/repos/asf/mahout/commit/b5fe4aab
Tree: http://git-wip-us.apache.org/repos/asf/mahout/tree/b5fe4aab
Diff: http://git-wip-us.apache.org/repos/asf/mahout/diff/b5fe4aab
Branch: refs/heads/master
Commit: b5fe4aab22e7867ae057a6cdb1610cfa17555311
Parents: 3351b75
Author: pferrel <pa...@occamsmachete.com>
Authored: Tue Sep 13 13:02:14 2016 -0700
Committer: pferrel <pa...@occamsmachete.com>
Committed: Tue Sep 13 13:02:14 2016 -0700
----------------------------------------------------------------------
CHANGELOG | 627 -------------------
.../mahout/math/cf/SimilarityAnalysis.scala | 192 +++++-
.../mahout/cf/SimilarityAnalysisSuite.scala | 125 +++-
3 files changed, 272 insertions(+), 672 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/mahout/blob/b5fe4aab/CHANGELOG
----------------------------------------------------------------------
diff --git a/CHANGELOG b/CHANGELOG
deleted file mode 100644
index 5cd8af5..0000000
--- a/CHANGELOG
+++ /dev/null
@@ -1,627 +0,0 @@
-Mahout Change Log
-
-Release 0.12.0 - unreleased
-
- MAHOUT-1775: FileNotFoundException caused by aborting the process of downloading Wikipedia dataset (Bowei Zhang via smarthi)
-
- MAHOUT-1771: Cluster dumper omits indices and 0 elements for dense vector or sparse containing 0s (srowen)
-
- MAHOUT-1613: classifier.df.tools.Describe does not handle -D parameters (haohui mai via smarthi)
-
- MAHOUT-1642: Iterator class within SimilarItems class always misses the first element (Oleg Zotov via smarthi)
-
- MAHOUT-1675: Remove MLP from codebase (ZJaffe via smarthi)
-
-Release 0.11.0 - 2015-08-07
-
- MAHOUT-1744: Deprecate lucene2seq (apalumbo)
-
- MAHOUT-1761: Upgraded to Apache parent pom v17 (sslavic)
-
- MAHOUT-1745: Purge deprecated ConcatVectorsJob from codebase (apalumbo)
-
- MAHOUT-1757: small fix in spca formula (smarthi)
-
- MAHOUT-1756: Missing +=: and *=: operators on vectors (smarthi)
-
- NOJIRA: Clean up CLI help for spark-rowsimilarity and fixed test that intermitently failed (pferrel)
-
- MAHOUT-1685: Move Mahout shell to Spark 1.3+ (dlyubimov, apalumbo)
-
- MAHOUT-1653: Spark 1.3 (pferrel, apalumbo)
-
- MAHOUT-1754: Distance and squared distance matrices routines (dlyubimov)
-
- MAHOUT-1753: First and second moment routines (dlyubimov)
-
- MAHOUT-1746: mxA ^ 2, mxA ^ 0.5 to mean the same thing as mxA * mxA and mxA ::= sqrt _ (dlyubimov)
-
- MAHOUT-1736: Implement allreduceBlock() on H2O (avati)
-
- MAHOUT-1752: Implement CbindScalar operator on H2O (avati)
-
- MAHOUT-1660: Hadoop1HDFSUtil.readDRMHEader should be taking Hadoop conf (dlyubimov)
-
- MAHOUT-1713: Performance and parallelization improvements for AB', A'B, A'A spark physical operators (dlyubimov)
-
- MAHOUT-1714: Add MAHOUT_OPTS environment when running Spark shell (dlyubimov)
-
- MAHOUT-1715: Closeable API for broadcast tensors (dlyubimov)
-
- MAHOUT-1716: Scala logging style (dlyubimov)
-
- MAHOUT-1717: allreduceBlock() operator api and Spark implementation (dlyubimov)
-
- MAHOUT-1718: Support for conversion of any type-keyed DRM into ordinally-keyed DRM (dlyubimov)
-
- MAHOUT-1719: Unary elementwise function operator and function fusions (dlyubimov)
-
- MAHOUT-1720: Support 1 cbind X, X cbind 1 etc. for both Matrix and DRM (dlyubimov)
-
- MAHOUT-1721: rowSumsMap() summary for non-int-keyed DRMs (dlyubimov)
-
- MAHOUT-1722: DRM row sampling api (dlyubimov)
-
- MAHOUT-1723: Optional structural "flavor" abstraction for in-core matrices (dlyubimov)
-
- MAHOUT-1724: Optimizations of matrix-matrix in-core multiplication based on structural flavors (dlyubimov)
-
- MAHOUT-1725: elementwise power operator ^ (dlyubimov)
-
- MAHOUT-1726: R-like vector concatenation operator (dlyubimov)
-
- MAHOUT-1727: Elementwise analogues of scala.math functions for tensor types (dlyubimov)
-
- MAHOUT-1728: In-core functional assignments (dlyubimov)
-
- MAHOUT-1729: Straighten out behavior of Matrix.iterator() and iterateNonEmpty() (dlyubimov)
-
- MAHOUT-1730: New mutable transposition view for in-core matrices (dlyubimov)
-
- MAHOUT-1731: Deprecate SparseColumnMatrix (dlyubimov)
-
- MAHOUT-1732: Native support for kryo serialization of tensor types (dlyubimov)
-
-Release 0.10.1 - 2015-05-31
-
- MAHOUT-1704: Pare down dependency jar for h2o (apalumbo)
-
- MAHOUT-1697: Fixed paths to which math-scala and spark modules docs get packaged under in bin distribution archive (sslavic)
-
- MAHOUT-1696: QRDecomposition.solve(...) can return incorrect Matrix types (apalumbo)
-
- MAHOUT-1690: CLONE - Some vector dumper flags are expecting arguments. (smarthi)
-
- MAHOUT-1693: FunctionalMatrixView materializes row vectors in scala shell (apalumbo)
-
- MAHOUT-1680: Renamed mahout-distribution to apache-mahout-distribution (sslavic)
-
-Release 0.10.0 - 2015-04-11
-
- MAHOUT-1630: Incorrect SparseColumnMatrix.numSlices() causes IndexException in toString() (Oleg Nitz, smarthi)
-
- MAHOUT-1665: Update hadoop commands in example scripts (akm)
-
- MAHOUT-1676: Deprecate MLP, ConcatenateVectorsJob and ConcatenateVectorsReducer in the codebase (apalumbo)
-
- MAHOUT-1622: MultithreadedBatchItemSimilarities outputs incorrect number of similarities (Jesse Daniels, Anand Avati via smarthi)
-
- MAHOUT-1605: Make VisualizerTest locale independent (Frank Rosner, Anand Avati via smarthi)
-
- MAHOUT-1635: Getting an exception when I provide classification labels manually for Naive Bayes (apalumbo)
-
- MAHOUT-1662: Potential Path bug in SequenceFileVaultIterator breaks DisplaySpectralKMeans (Shannon Quinn)
-
- MAHOUT-1656: Change SNAPSHOT version from 1.0 to 0.10.0 (smarthi)
-
- MAHOUT-1593: cluster-reuters.sh does not work complaining java.lang.IllegalStateException (smarthi via akm)
-
- MAHOUT-1661: All Lanczos modules marked as @Deprecated and slated for removal in future releases (Shannon Quinn)
-
- MAHOUT-1638: H2O bindings fail at drmParallelizeWithRowLabels(...) (Anand Avati via apalumbo)
-
- MAHOUT-1667: Hadoop 1 and 2 profile in POM (sslavic)
-
- MAHOUT-1564: Naive Bayes Classifier for New Text Documents (apalumbo)
-
- MAHOUT-1524: Script to auto-generate and view the Mahout website on a local machine (Saleem Ansari via apalumbo)
-
- MAHOUT-1589: Deprecate mahout.cmd due to lack of support
-
- MAHOUT-1655: Refactors mr-legacy into mahout-hdfs and mahout-mr, Spark now depends on much reduced mahout-hdfs
-
- MAHOUT-1522: Handle logging levels via log4j.xml (akm)
-
- MAHOUT-1602: Euclidean Distance Similarity Math (Leonardo Fernandez Sanchez, smarthi)
-
- MAHOUT-1619: HighDFWordsPruner overwrites cache files (Burke Webster, smarthi)
-
- MAHOUT-1516: classify-20newsgroups.sh failed: /tmp/mahout-work-jpan/20news-all does not exists in hdfs. (Jian Pan via apalumbo)
-
- MAHOUT-1559: Add documentation for and clean up the wikipedia classifier example (apalumbo)
-
- MAHOUT-1598: extend seq2sparse to handle multiple text blocks of same document (Wolfgang Buchnere via akm)
-
- MAHOUT-1659: Remove deprecated Lanczos solver from spectral clustering in mr-legacy (Shannon Quinn)
-
- MAHOUT-1612: NullPointerException happens during JSON output format for clusterdumper (smarthi, Manoj Awasthi)
-
- MAHOUT-1652: Java 7 update (smarthi)
-
- MAHOUT-1639: Streaming kmeans doesn't properly validate estimatedNumMapClusters -km (smarthi)
-
- MAHOUT-1493: Port Naive Bayes to Scala DSL (apalumbo)
-
- MAHOUT-1611: Preconditions.checkArgument in org.apache.mahout.utils.ConcatenateVectorsJob (Haishou Ma via smarthi)
-
- MAHOUT-1615: SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles (Anand Avati, dlyubimov, apalumbo)
-
- MAHOUT-1610: Update tests to pass in Java 8 (srowen)
-
- MAHOUT-1608: Add option in WikipediaToSequenceFile to remove category labels from documents (apalumbo)
-
- MAHOUT-1604: Spark version of rowsimilarity driver and associated additions to SimilarityAnalysis.scala (pferrel)
-
- MAHOUT-1500: H2O Integration (Anand Avati via apalumbo)
-
- MAHOUT-1606 - Add rowSums, rowMeans and diagonal extraction operations to distributed matrices (dlyubimov)
-
- MAHOUT-1603: Tweaks for Spark 1.0.x (dlyubimov & pferrel)
-
- MAHOUT-1596: implement rbind() operator (Anand Avati and dlyubimov)
-
- MAHOUT-1597: A + 1.0 (element-wise scala operation) gives wrong result if rdd is missing rows, Spark side (dlyubimov)
-
- MAHOUT-1595: MatrixVectorView - implement a proper iterateNonZero() (Anand Avati via dlyubimov)
-
- MAHOUT-1590 Mahout unit test failures due to guava version conflict on hadoop 2 (Venkat Ranganathan via sslavic)
-
- MAHOUT-1529(e): Move dense/sparse matrix test in mapBlock into spark (Anand Avati via dlyubimov)
-
- MAHOUT-1583: cbind() operator for Scala DRMs (dlyubimov)
-
- MAHOUT-1563: Eliminated warnings about multiple scala versions (sslavic)
-
- MAHOUT-1541, MAHOUT-1568, MAHOUT-1569: Created text-delimited file I/O traits and classes on spark, a MahoutDriver for a CLI and a ItemSimilairtyDriver using the CLI
-
- MAHOUT-1573: More explicit parallelism adjustments in math-scala DRM apis; elements of automatic parallelism management (dlyubimov)
-
- MAHOUT-1580: Optimize getNumNonZeroElements() (ssc)
-
- MAHOUT-1464: Cooccurrence Analysis on Spark (pat)
-
- MAHOUT-1578: Optimizations in matrix serialization (ssc)
-
- MAHOUT-1572: blockify() to detect (naively) the data sparsity in the loaded data (dlyubimov)
-
- MAHOUT-1571: Functional Views are not serialized as dense/sparse correctly (dlyubimov)
-
- MAHOUT-1566: (Experimental) Regular ALS factorizer with conversion tests, optimizer enhancements and bug fixes (dlyubimov)
-
- MAHOUT-1537: Minor fixes to spark-shell (Anand Avati via dlyubimov)
-
- MAHOUT-1529: Finalize abstraction of distributed logical plans from backend operations (dlyubimov)
-
- MAHOUT-1489: Interactive Scala & Spark Bindings Shell & Script processor (dlyubimov)
-
- MAHOUT-1346: Spark Bindings (DRM) (dlyubimov)
-
- MAHOUT-1555: Exception thrown when a test example has the label not present in training examples (Karol Grzegorczyk via smarthi)
-
- MAHOUT-1446: Create an intro for matrix factorization (Jian Wang via ssc)
-
- MAHOUT-1480: Clean up website on 20 newsgroups (Andrew Palumbo via ssc)
-
- MAHOUT-1561: cluster-syntheticcontrol.sh not running locally with MAHOUT_LOCAL=true (Andrew Palumbo via ssc)
-
- MAHOUT-1558: Clean up classify-wiki.sh and add in a binary classification problem (Andrew Palumbo via ssc)
-
- MAHOUT-1560: Last batch is not filled correctly in MultithreadedBatchItemSimilarities (Jaros\u0142aw Bojar)
-
- MAHOUT-1554: Provide more comprehensive classification statistics (Karol Grzegorczyk via ssc)
-
- MAHOUT-1548: Fix broken links in quickstart webpage (Andrew Palumbo via ssc)
-
- MAHOUT-1542: Tutorial for playing with Mahout's Spark shell (ssc)
-
- MAHOUT-1533: Remove Frequent Pattern Mining (ssc)
-
- MAHOUT-1532: Add solve() function to the Scala DSL (ssc)
-
- MAHOUT-1530: Custom prompt and welcome message for the Spark Shell (ssc)
-
- MAHOUT-1527: Fix wikipedia classifier example (Andrew Palumbo via ssc)
-
- MAHOUT-1526: Ant file in examples (ssc)
-
- MAHOUT-1523: Remove @author tags in sparkbindings (ssc)
-
- MAHOUT-1521: lucene2seq - Error trying to load data from stored field (when non-indexed) (Terry Blankers via frankscholten)
-
- MAHOUT-1520: Fix links in Mahout website documentation (Saleem Ansari via smarthi)
-
- MAHOUT-1519: Remove StandardThetaTrainer (Andrew Palumbo via ssc)
-
- MAHOUT-1517: Remove casts to int in ALSWRFactorizer (ssc)
-
- MAHOUT-1513: Deprecate Canopy Clustering (ssc)
-
- MAHOUT-1511: Renaming core to mrlegacy (frankscholten)
-
- MAHOUT-1510: Goodbye MapReduce (ssc)
-
- MAHOUT-1509: Invalid URL in link from "quick start/basics" page (Nick Martin, smarthi)
-
- MAHOUT-1508: Performance problems with sparse matrices (ssc)
-
- MAHOUT-1505: structure of clusterdump's JSON output (akm)
-
- MAHOUT-1504: Enable/fix thetaSummer job in TrainNaiveBayesJob (Andrew Palumbo, smarthi)
-
- MAHOUT-1503: TestNaiveBayesDriver fails in sequential mode (Andrew Palumbo, smarthi)
-
- MAHOUT-1502: Update Naive Bayes Webpage to Current Implementation (Andrew Palumbo via ssc)
-
- MAHOUT-1501: ClusterOutputPostProcessorDriver has private default constructor (ssc)
-
- MAHOUT-1498: DistributedCache.setCacheFiles in DictionaryVectorizer overwrites jars pushed using oozie (Sergey via ssc)
-
- MAHOUT-1497: mahout resplit not producing splited files (ssc)
-
- MAHOUT-1496: Create a website describing the distributed ALS recommender (Jian Wang via ssc)
-
- MAHOUT-1491: Spectral KMeans Clustering doesn't clean its /tmp dir and fails when seeing it again (smarthi)
-
- MAHOUT-1488: DisplaySpectralKMeans fails: examples/output/clusteredPoints/part-m-00000 does not exist (Saleem Ansari via smarthi)
-
- MAHOUT-1483: Organize links in web site navigation bar (akm)
-
- MAHOUT-1482: Rework quickstart website (Jian Wang via ssc)
-
- MAHOUT-1476: Cleanup website on Hidden Markov Models (akm)
-
- MAHOUT-1475: Cleanup website on Naive Bayes (smarthi)
-
- MAHOUT-1472: Cleanup website on fuzzy kmeans (smarthi)
-
- MAHOUT-1471: Cleanup website for Canopy clustering (smarthi)
-
- MAHOUT-1468: Creating a new page for StreamingKMeans documentation on mahout website (Maxim Arap and Pavan Kumar via akm)
-
- MAHOUT-1467: ClusterClassifier readPolicy leaks file handles (Avi Shinnar, smarthi)
-
- MAHOUT-1466: Cluster visualization fails to execute (ssc)
-
- MAHOUT-1465: Clean up README (akm)
-
- MAHOUT-1463: Modify OnlineSummarizers to use the TDigest dependency from Maven Central (tdunning, smarthi)
-
- MAHOUT-1460: Remove reference to Dirichlet in ClusterIterator (frankscholten)
-
- MAHOUT-1459: Move Hadoop related code out of CanopyClusterer (frankscholten)
-
- MAHOUT-1458: Remove KMeansConfigKeys and FuzzyKMeansConfigKeys (frankscholten)
-
- MAHOUT-1457: Move EigenSeedGenerator into spectral kmeans package (frankscholten)
-
- MAHOUT-1455: Forkcount config causes JVM crashes during build (frankscholten)
-
- MAHOUT-1451: Cleaning up the examples for clustering on the website (Gaurav Misra via ssc)
-
- MAHOUT-1450: Cleaning up clustering documentation on mahout website (Pavan Kumar)
-
- MAHOUT-1449: Update the Known Issues in Random Forests Page (Manoj Awasthi via ssc)
-
- MAHOUT-1448: In Random Forest, the training does not support multiple input files. The input dataset must be one single file. (Manoj Awasthi via ssc)
-
- MAHOUT-1447: ImplicitFeedbackAlternatingLeastSquaresSolver tests and features (Adam Ilardi via ssc)
-
- MAHOUT-1445: Create an intro for item based recommender (Nick Martin via ssc)
-
- MAHOUT-1440: Add option to set the RNG seed for inital cluster generation in Kmeans/fKmeans (Andrew Palumbo via ssc)
-
- MAHOUT-1438: "quickstart" tutorial for building a simple recommender (Maciej Mazur and Steve Cook via ssc)
-
- MAHOUT-1434: Dead links on the web ste (Kevin Moulart, smarthi)
-
- MAHOUT-1433: Make SVDRecommender look at all unknown items of a user per default (ssc)
-
- MAHOUT-1429: Parallelize YtransposeY in ImplicitFeedbackAlternatingLeastSquaresSolver (Adam Ilardi via ssc)
-
- MAHOUT-1428: Recommending already consumed items (Dodi Hakim via ssc)
-
- MAHOUT-1425: SGD classifier example with bank marketing dataset. (frankscholten)
-
- MAHOUT-1420: Add solr-recommender to examples (Pat Ferrel via akm)
-
- MAHOUT-1419: Random decision forest is excessively slow on numeric features (srowen)
-
- MAHOUT-1417: Random decision forest implementation fails in Hadoop 2 (srowen)
-
- MAHOUT-1416: Make access of DecisionForest.read(dataInput) less restricted (Manoj Awasthi via smarthi)
-
- MAHOUT-1415: Clone method on sparse matrices fails if there is an empty row which has not been set explicitly (till.rohrmann via ssc)
-
- MAHOUT-1413: Rework Algorithms page (ssc)
-
- MAHOUT-1388: Add command line support and logging for MLP (Yexi Jiang via ssc)
-
- MAHOUT-1385: Caching Encoders don't cache (Johannes Schulte, Manoj Awasthi via ssc)
-
- MAHOUT-1356: Ensure unit tests fail fast when writing outside mvn target directory (isabel, smarthi, dweiss, frankscholten, akm)
-
- MAHOUT-1329: Mahout for hadoop 2 (gcapan, Sergey Svinarchuk)
-
- MAHOUT-1310: Mahout support windows (Sergey Svinarchuk via ssc)
-
- MAHOUT-1278: Upgraded to apache parent pom version 16 (sslavic)
-
-Release 0.9 - 2014-02-01
-
- MAHOUT-1387: Create page for release notes (ssc)
-
- MAHOUT-1411: Random test failures from TDigestTest (smarthi)
-
- MAHOUT-1410: clusteredPoints do not contain a vector id (smarthi, Andrew Musselman)
-
- MAHOUT-1409: MatrixVectorView has index check error (tdunning)
-
- MAHOUT-1402: Zero clusters using streaming k-means option in cluster-reuters.sh (smarthi)
-
- MAHOUT-1401: Resurrect Frequent Pattern mining (smarthi)
-
- MAHOUT-1400: Remove references to deprecated and removed algorithms from examples scripts (ssc)
-
- MAHOUT-1399: Fixed multiple slf4j bindings when running Mahout examples issue (sslavic)
-
- MAHOUT-1398: FileDataModel should provide a constructor with a delimiterPattern (Roy Guo via ssc)
-
- MAHOUT-1396: Accidental use of commons-math won't work with next Hadoop 2 release (srowen)
-
- MAHOUT-1394: Undeprecate Lanczos (ssc)
-
- MAHOUT-1393: Remove duplicated code from getTopTerms and getTopFeatures in AbstractClusterWriter (Diego Carrion via smarthi)
-
- MAHOUT-1392: Streaming KMeans should write centroid output to a 'part-r-xxxx' file when executed in sequential mode (smarthi)
-
- MAHOUT-1390: SVD hangs for certain inputs (tdunning)
-
- MAHOUT-1389: Complementary Naive Bayes Classifier not getting called when "-c" option is activated (Gouri Shankar Majumdar via smarthi)
-
- MAHOUT-1384: Executing the MR version of Naive Bayes/CNB of classify_20newgroups.sh fails in seqdirectory step (smarthi)
-
- MAHOUT-1382: Upgrade Mahout third party jars for 0.9 Release (smarthi)
-
- MAHOUT-1380: Streaming KMeans fails when executed in Sequential Mode (smarthi)
-
- MAHOUT-1379: ClusterQualitySummarizer fails with the new T-Digest for clusters with 1 data point (smarthi)
-
- MAHOUT-1378: Running Random Forest with Ignored features fails when loading feature descriptor from JSON file (Sam Wu via smarthi)
-
- MAHOUT-1377: Exclude JUnit.jar from tarball (Sergey Svinarchuk via smarthi)
-
- MAHOUT-1374: Ability to provide input file with userid, itemid pair (Aliaksei Litouka via ssc)
-
- MAHOUT-1371: Arff loader can misinterpret nominals with integer, real or string (Mansur Iqbal via smarthi)
-
- MAHOUT-1370: Vectordump doesn't write to output file in MapReduce Mode (smarthi)
-
- MAHOUT-1368: Convert OnlineSummarizer to use the new TDigest (tdunning)
-
- MAHOUT-1367: WikipediaXmlSplitter --> Exception in thread "main" java.lang.NullPointerException (smarthi)
-
- MAHOUT-1364: Upgrade Mahout codebase to Lucene 4.6 (Frank Scholten)
-
- MAHOUT-1363: Rebase packages in mahout-scala (dlyubimov)
-
- MAHOUT-1362: Remove examples/bin/build-reuters.sh (smarthi)
-
- MAHOUT-1361: Online algorithm for computing accurate Quantiles using 1-D clustering (tdunning)
-
- MAHOUT-1358: StreamingKMeansThread throws IllegalArgumentException when REDUCE_STREAMING_KMEANS is set to true (smarthi)
-
- MAHOUT-1355: InteractionValueEncoder produces wrong traceDictionary entries (Johannes Schulte via smarthi)
-
- MAHOUT-1353: Visibility of preparePreferenceMatrix directory location (Pat Ferrel, ssc)
-
- MAHOUT-1352: Option to change RecommenderJob output format (Pat Ferrel, ssc)
-
- MAHOUT-1351: Adding DenseVector support to AbstractCluster (David DeBarr via smarthi)
-
- MAHOUT-1349: Clusterdumper/loadTermDictionary crashes when highest index in (sparse) dictionary vector is larger than dictionary vector size (Andrew Musselman via smarthi)
-
- MAHOUT-1347: Add Streaming K-Means clustering algorithm to examples/bin/cluster-reuters.sh (smarthi)
-
- MAHOUT-1345: Enable randomised testing for all Mahout modules (Dawid Weiss, Isabel, sslavic, Frank Scholten, smarthi)
-
- MAHOUT-1343: JSON output format support in cluster dumper (Telvis Calhoun via sslavic)
-
- MAHOUT-1333: Fixed examples bin directory permissions in distribution archives (Mike Percy via sslavic)
-
- MAHOUT-1319: seqdirectory -filter argument silently ignored when run as MR (smarthi)
-
- MAHOUT-1317: Clarify some of the messages in Preconditions.checkArgument (Nikolai Grinko, smarthi)
-
- MAHOUT-1314: StreamingKMeansReducer throws NullPointerException when REDUCE_STREAMING_KMEANS is set to true (smarthi)
-
- MAHOUT-1313: Fixed unwanted integral division bug in RowSimilarityJob downsampling code where precision should have been retained (sslavic)
-
- MAHOUT-1312: LocalitySensitiveHashSearch does not limit search results (sslavic)
-
- MAHOUT-1308: Cannot extend CandidateItemsStrategy due to restricted visibility (David Geiger, smarthi)
-
- MAHOUT-1301: toString() method of SequentialAccessSparseVector has excess comma at the end (Alexander Senov, smarthi)
-
- MAHOUT-1297: New module for linear algebra scala DSL (dlyubimov)
-
- MAHOUT-1296: Remove deprecated algorithms (ssc)
-
- MAHOUT-1295: Excluded all Maven's target directories from distribution archives (sslavic)
-
- MAHOUT-1294: Cleanup previously installed artifacts from CI server local repository (sslavic)
-
- MAHOUT-1293: Source distribution tar.gz archive cannot be unpacked on Linux (sslavic)
-
- MAHOUT-1292: lucene2seq should validate the 'id' field (Frank Scholten via smarthi)
-
- MAHOUT-1291: MahoutDriver yields cosmetically suboptimal exception when bin/mahout runs without args, on some Hadoop versions (srowen)
-
- MAHOUT-1290: Issue when running Mahout Recommender Demo (Helder Garay Martins via smarthi)
-
- MAHOUT-1289: Move downsampling code into RowSimilarityJob (ssc)
-
- MAHOUT-1287: classifier.sgd.CsvRecordFactory incorrectly parses CSV format (Alex Franchuk via smarthi)
-
- MAHOUT-1285: Arff loader can misparse string data as double (smarthi)
-
- MAHOUT-1284: DummyRecordWriter's bug with reused Writables (Maysam Yabandeh via smarthi)
-
- MAHOUT-1275: Dropped bz2 distribution format for source and binaries (sslavic)
-
- MAHOUT-1265: Multilayer Perceptron (Yexi Jiang via smarthi)
-
- MAHOUT-1261: TasteHadoopUtils.idToIndex can return an int that has size Integer.MAX_VALUE (Carl Clark, smarthi)
-
- MAHOUT-1242: No key redistribution function for associative maps (Tharindu Rusira via smarthi)
-
- MAHOUT-1030: Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable (Andrew Musselman, Pat Ferrel, Jeff Eastman, Lars Norskog, smarthi)
-
-Release 0.8 - 2013-07-25
-
- MAHOUT-1272: Parallel SGD matrix factorizer for SVDrecommender (Peng Cheng via ssc)
-
- MAHOUT-1271: classify-20newsgroups.sh fails during the seqdirectory step (smarthi)
-
- MAHOUT-1269: Cleanup deprecated Lucene 3.x API calls in lucene2seq utility unit tests (smarthi)
-
- MAHOUT-833: Make conversion to sequence files map-reduce (Josh Patterson, smarthi)
-
- MAHOUT-1268: Wrong output directory for CVB (Mark Wicks via ssc)
-
- MAHOUT-1264: Performance optimizations in RecommenderJob (ssc)
-
- MAHOUT-1262: Cleanup LDA code (ssc)
-
- MAHOUT-1255: Fix for weights in Multinomial sometimes overflowing in BallKMeans (dfilimon)
-
- MAHOUT-1254: Final round of cleanup for StreamingKMeans (dfilimon)
-
- MAHOUT-1263: Serialise/Deserialise Lambda value for OnlineLogisticRegression (Mike Davy via smarthi)
-
- MAHOUT-1258: Another shot at findbugs and checkstyle (ssc)
-
- MAHOUT-1253: Add experiment tools for StreamingKMeans, part 1 (dfilimon)
-
- MAHOUT-884: Matrix Concatenate Utility (Lance Norskog via smarthi)
-
- MAHOUT-1250: Deprecate unused algorithms (ssc)
-
- MAHOUT-1251: Optimize MinHashMapper (ssc)
-
- MAHOUT-1211: Disabled swallowing of IOExceptions is Closeables.close for writers (dfilimon)
-
- MAHOUT-1164: Make ARFF integration generate meta-data in JSON format (Marty Kube via ssc)
-
- MAHOUT-1164: Make ARFF integration generate meta-data in JSON format (Marty Kube via ssc)
-
- MAHOUT-1163: Make random forest classifier meta-data file human readable (Marty Kube via ssc)
-
- MAHOUT-1243: Dictionary file format in Lucene-Mahout integration is not in SequenceFileFormat (ssc)
-
- MAHOUT-974: org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob use integer as userId and itemId (ssc)
-
- MAHOUT-1052: Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values) (Elena Smirnova via smarthi)
-
- MAHOUT-1237: Total cluster cost isn't computed properly (dfilimon)
-
- MAHOUT-1196: LogisticModelParameters uses csv.getTargetCategories() even if csv is not used. (Vineet Krishnan via ssc)
-
- MAHOUT-1224: Add the option of running a StreamingKMeans pass in the Reducer before BallKMeans (dfilimon)
-
- MAHOUT-993: Some vector dumper flags are expecting arguments. (Andrew Look via robinanil)
-
- MAHOUT-1228: Cleanup .gitignore (Stevo Slavic via ssc)
-
- MAHOUT-1047: CVB hangs after completion (Angel Martinez Gonzalez via smarthi)
-
- MAHOUT-1235: ParallelALSFactorizationJob does not use VectorSumCombiner (ssc)
-
- MAHOUT-1230: SparceMatrix.clone() is not deep copy (Maysam Yabandeh via tdunning)
-
- MAHOUT-1232: VectorHelper.topEntries() throws a NPE when number of NonZero elements in vector < maxEntries (smarthi)
-
- MAHOUT-1229: Conf directory content from Mahout distribution archives cannot be unpacked (Stevo Slavic via smarthi)
-
- MAHOUT-1213: SSVD job doesn't clean it's temp dir, and fails when seeing it again (smarthi)
-
- MAHOUT-1223: Fixed point skipped in StreamingKMeans when iterating through centroids from a reducer (dfilimon)
-
- MAHOUT-1222: Fix total weight in FastProjectionSearch (dfilimon)
-
- MAHOUT-1219: Remove LSHSearcher from StreamingKMeansTest. It causes it to sometimes fail (dfilimon)
-
- MAHOUT-1221: SparseMatrix.viewRow is sometimes readonly. (Maysam Yabandeh via smarthi)
-
- MAHOUT-1219: Remove LSHSearcher from SearchQualityTest. It causes it to fail, but the failure is not very meaningful (dfilimon)
-
- MAHOUT-1217: Nearest neighbor searchers sometimes fail to remove points: fix in FastProjectionSearch's searchFirst (dfilimon)
-
- MAHOUT-1216: Add locality sensitive hashing and a LocalitySensitiveHash searcher (dfilimon)
-
- MAHOUT-1181: Adding StreamingKMeans MapReduce classes (dfilimon)
-
- MAHOUT-1212: Incorrect classify-20newsgroups.sh file description (Julian Ortega via smarthi)
-
- MAHOUT-1209: DRY out maven-compiler-plugin configuration (Stevo Slavic via smarthi)
-
- MAHOUT-1207: Fix typos in description in parent pom (Stevo Slavic via smarthi)
-
- MAHOUT-1199: Improve javadoc comments of mahout-integration (Angel Martinez Gonzalez via smarthi)
-
- MAHOUT-1162: Adding BallKMeans and StreamingKMeans clustering algorithms (dfilimon)
-
- MAHOUT-1205: ParallelALSFactorizationJob should leverage the distributed cache (ssc)
-
- MAHOUT-1156: Adding nearest neighbor Searchers (dfilimon)
-
- MAHOUT-1202: Speed up Vector operations (dfilimon)
-
- MAHOUT-1155: Make MatrixSlice a Vector (and fix Centroid cloning; MAHOUT-1202) (dfilimon)
-
- MAHOUT-1189: CosineDistanceMeasure doesn't return 0 for two 0 vectors (dfilimon)
-
- MAHOUT-1180: Multinomial<T> throws ConcurrentModificationException when iterating and setting probabilities (dfilimon)
-
- MAHOUT-1192: Speed up Vector Operations (robinanil)
-
- MAHOUT-1191: Cleanup Vector Benchmarks make it less variable (robinanil)
-
- MAHOUT-1190: SequentialAccessSparseVector function assignment is very slow and other iterator woes (robinanil)
-
- MAHOUT-1188: Inconsistent reference to Lucene versions in code and POM (smarthi)
-
- MAHOUT-1161: Unable to run CJKAnalyzer for conversion of a sequence file to sparse vector due to instantiation exception (ssc)
-
- MAHOUT-1187: Update Commons Lang to Commons Lang3 (smarthi)
-
- MAHOUT-1184 Another take at pmd, findbugs and checkstyle (ssc)
-
- MAHOUT-1182: Remove useless append (Dave Brosius via tdunning)
-
- MAHOUT-1176: Introduce a changelog file to raise contributors attribution (ssc)
-
- MAHOUT-1108: Allows cluster-reuters.sh example to be executed on a cluster (elmer.garduno via gsingers)
-
- MAHOUT-961: Fix issue in decision forest tree visualizer to properly show stems of tree (Ikumasa Mukai via gsingers)
-
- MAHOUT-944: Create SequenceFiles out of Lucene document storage (no term vectors required) (Frank Scholten, gsingers)
-
- MAHOUT-958: Fix issue with globs in RepresentativePointsDriver (Adam Baron, Vikram Dixit K, ehgjr via gsingers)
-
- MAHOUT-1084: Fixed issue with too many clusters in synthetic control example (liutengfei, gsingers)
-
- MAHOUT-1103: Fixed issue with splitting clusters on Hadoop (Matt Molek, gsingers)
-
- MAHOUT-1126: Filter out bad META-INF files in job packaging (Pat Ferrel, gsingers)
-
- MAHOUT-1211: Change deprecated Closeables.closeQuietly calls (smarthi, gsingers, srowen, dlyubimov)
http://git-wip-us.apache.org/repos/asf/mahout/blob/b5fe4aab/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala
----------------------------------------------------------------------
diff --git a/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala b/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala
index 4632468..a10b942 100644
--- a/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala
+++ b/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala
@@ -44,23 +44,34 @@ object SimilarityAnalysis extends Serializable {
/** Compares (Int,Double) pairs by the second value */
private val orderByScore = Ordering.fromLessThan[(Int, Double)] { case ((_, score1), (_, score2)) => score1 > score2}
+ lazy val defaultParOpts = ParOpts()
+
/**
* Calculates item (column-wise) similarity using the log-likelihood ratio on A'A, A'B, A'C, ...
* and returns a list of similarity and cross-similarity matrices
- * @param drmARaw Primary interaction matrix
+ *
+ * @param drmARaw Primary interaction matrix
* @param randomSeed when kept to a constant will make repeatable downsampling
* @param maxInterestingItemsPerThing number of similar items to return per item, default: 50
* @param maxNumInteractions max number of interactions after downsampling, default: 500
+ * @param parOpts partitioning params for drm.par(...)
* @return a list of [[org.apache.mahout.math.drm.DrmLike]] containing downsampled DRMs for cooccurrence and
* cross-cooccurrence
*/
- def cooccurrences(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, maxInterestingItemsPerThing: Int = 50,
- maxNumInteractions: Int = 500, drmBs: Array[DrmLike[Int]] = Array()): List[DrmLike[Int]] = {
+ def cooccurrences(
+ drmARaw: DrmLike[Int],
+ randomSeed: Int = 0xdeadbeef,
+ maxInterestingItemsPerThing: Int = 50,
+ maxNumInteractions: Int = 500,
+ drmBs: Array[DrmLike[Int]] = Array(),
+ parOpts: ParOpts = defaultParOpts)
+ : List[DrmLike[Int]] = {
implicit val distributedContext = drmARaw.context
- // backend allowed to optimize partitioning
- drmARaw.par(auto = true)
+ // backend partitioning defaults to 'auto', which is often better decided by calling funciton
+ // todo: this should ideally be different per drm
+ drmARaw.par( min = parOpts.minPar, exact = parOpts.exactPar, auto = parOpts.autoPar)
// Apply selective downsampling, pin resulting matrix
val drmA = sampleDownAndBinarize(drmARaw, randomSeed, maxNumInteractions)
@@ -82,8 +93,9 @@ object SimilarityAnalysis extends Serializable {
// Now look at cross cooccurrences
for (drmBRaw <- drmBs) {
- // backend allowed to optimize partitioning
- drmBRaw.par(auto = true)
+ // backend partitioning defaults to 'auto', which is often better decided by calling funciton
+ // todo: this should ideally be different per drm
+ drmARaw.par( min = parOpts.minPar, exact = parOpts.exactPar, auto = parOpts.autoPar)
// Down-sample and pin other interaction matrix
val drmB = sampleDownAndBinarize(drmBRaw, randomSeed, maxNumInteractions).checkpoint()
@@ -100,21 +112,11 @@ object SimilarityAnalysis extends Serializable {
similarityMatrices = similarityMatrices :+ drmSimilarityAtB
drmB.uncache()
-
- //debug
- val atbRows = drmSimilarityAtB.nrow
- val atbCols = drmSimilarityAtB.ncol
- val i = 0
}
// Unpin downsampled interaction matrix
drmA.uncache()
- //debug
- val ataRows = drmSimilarityAtA.nrow
- val ataCols = drmSimilarityAtA.ncol
- val i = 0
-
// Return list of similarity matrices
similarityMatrices
}
@@ -123,23 +125,27 @@ object SimilarityAnalysis extends Serializable {
* Calculates item (column-wise) similarity using the log-likelihood ratio on A'A, A'B, A'C, ... and returns
* a list of similarity and cross-similarity matrices. Somewhat easier to use method, which handles the ID
* dictionaries correctly
+ *
* @param indexedDatasets first in array is primary/A matrix all others are treated as secondary
* @param randomSeed use default to make repeatable, otherwise pass in system time or some randomizing seed
* @param maxInterestingItemsPerThing max similarities per items
* @param maxNumInteractions max number of input items per item
+ * @param parOpts partitioning params for drm.par(...)
* @return a list of [[org.apache.mahout.math.indexeddataset.IndexedDataset]] containing downsampled
* IndexedDatasets for cooccurrence and cross-cooccurrence
*/
- def cooccurrencesIDSs(indexedDatasets: Array[IndexedDataset],
- randomSeed: Int = 0xdeadbeef,
- maxInterestingItemsPerThing: Int = 50,
- maxNumInteractions: Int = 500):
+ def cooccurrencesIDSs(
+ indexedDatasets: Array[IndexedDataset],
+ randomSeed: Int = 0xdeadbeef,
+ maxInterestingItemsPerThing: Int = 50,
+ maxNumInteractions: Int = 500,
+ parOpts: ParOpts = defaultParOpts):
List[IndexedDataset] = {
val drms = indexedDatasets.map(_.matrix.asInstanceOf[DrmLike[Int]])
val primaryDrm = drms(0)
val secondaryDrms = drms.drop(1)
val coocMatrices = cooccurrences(primaryDrm, randomSeed, maxInterestingItemsPerThing,
- maxNumInteractions, secondaryDrms)
+ maxNumInteractions, secondaryDrms, parOpts)
val retIDSs = coocMatrices.iterator.zipWithIndex.map {
case( drm, i ) =>
indexedDatasets(0).create(drm, indexedDatasets(0).columnIDs, indexedDatasets(i).columnIDs)
@@ -148,19 +154,110 @@ object SimilarityAnalysis extends Serializable {
}
/**
+ * Calculates item (column-wise) similarity using the log-likelihood ratio on A'A, A'B, A'C, ... and returns
+ * a list of similarity and cross-occurrence matrices. Somewhat easier to use method, which handles the ID
+ * dictionaries correctly and contains info about downsampling in each model calc.
+ *
+ * @param datasets first in array is primary/A matrix all others are treated as secondary, includes information
+ * used to downsample the input drm as well as the output llr(A'A), llr(A'B). The information
+ * is contained in each dataset in the array and applies to the model calculation of A' with
+ * the dataset. Todo: ignoring absolute threshold for now.
+ * @param randomSeed use default to make repeatable, otherwise pass in system time or some randomizing seed
+ * @param parOpts partitioning params for drm.par(...)
+ * @return a list of [[org.apache.mahout.math.indexeddataset.IndexedDataset]] containing downsampled
+ * IndexedDatasets for cooccurrence and cross-cooccurrence
+ */
+ def crossOccurrenceDownsampled(
+ datasets: List[DownsamplableCrossOccurrenceDataset],
+ randomSeed: Int = 0xdeadbeef):
+ List[IndexedDataset] = {
+
+
+ val crossDatasets = datasets.drop(1) // drop A
+ val primaryDataset = datasets.head // use A throughout
+ val drmARaw = primaryDataset.iD.matrix
+
+ implicit val distributedContext = primaryDataset.iD.matrix.context
+
+ // backend partitioning defaults to 'auto', which is often better decided by calling funciton
+ val parOptsA = primaryDataset.parOpts.getOrElse(defaultParOpts)
+ drmARaw.par( min = parOptsA.minPar, exact = parOptsA.exactPar, auto = parOptsA.autoPar)
+
+ // Apply selective downsampling, pin resulting matrix
+ val drmA = sampleDownAndBinarize(drmARaw, randomSeed, primaryDataset.maxElementsPerRow)
+
+ // num users, which equals the maximum number of interactions per item
+ val numUsers = drmA.nrow.toInt
+
+ // Compute & broadcast the number of interactions per thing in A
+ val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerColumn)
+
+ // Compute cooccurrence matrix A'A
+ val drmAtA = drmA.t %*% drmA
+
+ // Compute loglikelihood scores and sparsify the resulting matrix to get the similarity matrix
+ val drmSimilarityAtA = computeSimilarities(drmAtA, numUsers, primaryDataset.maxInterestingElements,
+ bcastInteractionsPerItemA, bcastInteractionsPerItemA, crossCooccurrence = false,
+ minLLROpt = primaryDataset.minLLROpt)
+
+ var similarityMatrices = List(drmSimilarityAtA)
+
+ // Now look at cross cooccurrences
+ for (dataset <- crossDatasets) {
+ // backend partitioning defaults to 'auto', which is often better decided by calling funciton
+ val parOptsB = dataset.parOpts.getOrElse(defaultParOpts)
+ dataset.iD.matrix.par(min = parOptsB.minPar, exact = parOptsB.exactPar, auto = parOptsB.autoPar)
+
+ // Downsample and pin other interaction matrix
+ val drmB = sampleDownAndBinarize(dataset.iD.matrix, randomSeed, dataset.maxElementsPerRow).checkpoint()
+
+ // Compute & broadcast the number of interactions per thing in B
+ val bcastInteractionsPerThingB = drmBroadcast(drmB.numNonZeroElementsPerColumn)
+
+ // Compute cross-cooccurrence matrix A'B
+ val drmAtB = drmA.t %*% drmB
+
+ val drmSimilarityAtB = computeSimilarities(drmAtB, numUsers, dataset.maxInterestingElements,
+ bcastInteractionsPerItemA, bcastInteractionsPerThingB, minLLROpt = dataset.minLLROpt)
+
+ similarityMatrices = similarityMatrices :+ drmSimilarityAtB
+
+ drmB.uncache()
+ }
+
+ // Unpin downsampled interaction matrix
+ drmA.uncache()
+
+ // Return list of datasets
+ val retIDSs = similarityMatrices.iterator.zipWithIndex.map {
+ case( drm, i ) =>
+ datasets(0).iD.create(drm, datasets(0).iD.columnIDs, datasets(i).iD.columnIDs)
+ }
+ retIDSs.toList
+
+ }
+
+ /**
* Calculates row-wise similarity using the log-likelihood ratio on AA' and returns a DRM of rows and similar rows
+ *
* @param drmARaw Primary interaction matrix
* @param randomSeed when kept to a constant will make repeatable downsampling
* @param maxInterestingSimilaritiesPerRow number of similar items to return per item, default: 50
* @param maxNumInteractions max number of interactions after downsampling, default: 500
+ * @param parOpts partitioning options used for drm.par(...)
*/
- def rowSimilarity(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, maxInterestingSimilaritiesPerRow: Int = 50,
- maxNumInteractions: Int = 500): DrmLike[Int] = {
+ def rowSimilarity(
+ drmARaw: DrmLike[Int],
+ randomSeed: Int = 0xdeadbeef,
+ maxInterestingSimilaritiesPerRow: Int = 50,
+ maxNumInteractions: Int = 500,
+ parOpts: ParOpts = defaultParOpts): DrmLike[Int] = {
implicit val distributedContext = drmARaw.context
- // backend allowed to optimize partitioning
- drmARaw.par(auto = true)
+ // backend partitioning defaults to 'auto', which is often better decided by calling funciton
+ // todo: should this ideally be different per drm?
+ drmARaw.par(min = parOpts.minPar, exact = parOpts.exactPar, auto = parOpts.autoPar)
// Apply selective downsampling, pin resulting matrix
val drmA = sampleDownAndBinarize(drmARaw, randomSeed, maxNumInteractions)
@@ -184,6 +281,7 @@ object SimilarityAnalysis extends Serializable {
/**
* Calculates row-wise similarity using the log-likelihood ratio on AA' and returns a drm of rows and similar rows.
* Uses IndexedDatasets, which handle external ID dictionaries properly
+ *
* @param indexedDataset compare each row to every other
* @param randomSeed use default to make repeatable, otherwise pass in system time or some randomizing seed
* @param maxInterestingSimilaritiesPerRow max elements returned in each row
@@ -211,9 +309,17 @@ object SimilarityAnalysis extends Serializable {
}
- def computeSimilarities(drm: DrmLike[Int], numUsers: Int, maxInterestingItemsPerThing: Int,
- bcastNumInteractionsB: BCast[Vector], bcastNumInteractionsA: BCast[Vector],
- crossCooccurrence: Boolean = true) = {
+ def computeSimilarities(
+ drm: DrmLike[Int],
+ numUsers: Int,
+ maxInterestingItemsPerThing: Int,
+ bcastNumInteractionsB: BCast[Vector],
+ bcastNumInteractionsA: BCast[Vector],
+ crossCooccurrence: Boolean = true,
+ minLLROpt: Option[Double] = None) = {
+
+ val minLLR = minLLROpt.getOrElse(0.0d) // accept all values if not specified
+
drm.mapBlock() {
case (keys, block) =>
@@ -245,11 +351,13 @@ object SimilarityAnalysis extends Serializable {
// val candidate = thingA -> normailizedLLR
// Enqueue item with score, if belonging to the top-k
- if (topItemsPerThing.size < maxInterestingItemsPerThing) {
- topItemsPerThing.enqueue(candidate)
- } else if (orderByScore.lt(candidate, topItemsPerThing.head)) {
- topItemsPerThing.dequeue()
- topItemsPerThing.enqueue(candidate)
+ if(candidate._2 >= minLLR) { // llr threshold takes precedence over max per row
+ if (topItemsPerThing.size < maxInterestingItemsPerThing) {
+ topItemsPerThing.enqueue(candidate)
+ } else if (orderByScore.lt(candidate, topItemsPerThing.head)) {
+ topItemsPerThing.dequeue()
+ topItemsPerThing.enqueue(candidate)
+ }
}
}
}
@@ -270,6 +378,7 @@ object SimilarityAnalysis extends Serializable {
* https://github.com/tdunning/in-memory-cooccurrence/blob/master/src/main/java/com/tdunning/cooc/Analyze.java
*
* additionally binarizes input matrix, as we're only interesting in knowing whether interactions happened or not
+ *
* @param drmM matrix to downsample
* @param seed random number generator seed, keep to a constant if repeatability is neccessary
* @param maxNumInteractions number of elements in a row of the returned matrix
@@ -325,3 +434,18 @@ object SimilarityAnalysis extends Serializable {
downSampledDrmI
}
}
+
+case class ParOpts( // this will contain the default `par` params except for auto = true
+ minPar: Int = -1,
+ exactPar: Int = -1,
+ autoPar: Boolean = true)
+
+/* Used to pass in data and params for downsampling the input data as well as output A'A, A'B, etc. */
+case class DownsamplableCrossOccurrenceDataset(
+ iD: IndexedDataset,
+ maxElementsPerRow: Int = 500, // usually items per user in the input dataset, used to ramdomly downsample
+ maxInterestingElements: Int = 50, // number of items/columns to keep in the A'A, A'B etc. where iD == A, B, C ...
+ minLLROpt: Option[Double] = None, // absolute threshold, takes precedence over maxInterestingElements if present
+ parOpts: Option[ParOpts] = None) // these can be set per dataset and are applied to each of the drms
+ // in crossOccurrenceDownsampled
+
http://git-wip-us.apache.org/repos/asf/mahout/blob/b5fe4aab/spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala
----------------------------------------------------------------------
diff --git a/spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala b/spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala
index 0b3b3eb..63e0df7 100644
--- a/spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala
+++ b/spark/src/test/scala/org/apache/mahout/cf/SimilarityAnalysisSuite.scala
@@ -17,9 +17,11 @@
package org.apache.mahout.cf
-import org.apache.mahout.math.cf.SimilarityAnalysis
+import org.apache.mahout.math.cf.{DownsamplableCrossOccurrenceDataset, SimilarityAnalysis}
import org.apache.mahout.math.drm._
+import org.apache.mahout.math.indexeddataset.BiDictionary
import org.apache.mahout.math.scalabindings.{MatrixOps, _}
+import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark
import org.apache.mahout.sparkbindings.test.DistributedSparkSuite
import org.apache.mahout.test.MahoutSuite
import org.scalatest.FunSuite
@@ -58,7 +60,7 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed
(1.7260924347106847, 0.6795961471815897, 0.6795961471815897, 1.7260924347106847, 0.0),
(0.0, 0.0, 0.0, 0.0, 4.498681156950466))
- final val matrixLLRCoocBtAControl = dense(
+ final val matrixLLRCoocAtBControl = dense(
(1.7260924347106847, 1.7260924347106847, 1.7260924347106847, 1.7260924347106847, 0.0),
(0.6795961471815897, 0.6795961471815897, 0.6795961471815897, 0.6795961471815897, 0.0),
(0.6795961471815897, 0.6795961471815897, 0.6795961471815897, 0.6795961471815897, 0.0),
@@ -66,7 +68,7 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed
(0.0, 0.0, 0.6795961471815897, 0.0, 4.498681156950466))
- test("cooccurrence [A'A], [B'A] boolbean data using LLR") {
+ test("Cross-occurrence [A'A], [B'A] boolbean data using LLR") {
val a = dense(
(1, 1, 0, 0, 0),
(0, 0, 1, 1, 0),
@@ -91,13 +93,13 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed
//cross similarity
val matrixCrossCooc = drmCooc(1).checkpoint().collect
- val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocBtAControl)
+ val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBControl)
n = (new MatrixOps(m = diff2Matrix)).norm
n should be < 1E-10
}
- test("cooccurrence [A'A], [B'A] double data using LLR") {
+ test("Cross-occurrence [A'A], [B'A] double data using LLR") {
val a = dense(
(100000.0D, 1.0D, 0.0D, 0.0D, 0.0D),
( 0.0D, 0.0D, 10.0D, 1.0D, 0.0D),
@@ -122,12 +124,12 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed
//cross similarity
val matrixCrossCooc = drmCooc(1).checkpoint().collect
- val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocBtAControl)
+ val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBControl)
n = (new MatrixOps(m = diff2Matrix)).norm
n should be < 1E-10
}
- test("cooccurrence [A'A], [B'A] integer data using LLR") {
+ test("Cross-occurrence [A'A], [B'A] integer data using LLR") {
val a = dense(
( 1000, 10, 0, 0, 0),
( 0, 0, -10000, 10, 0),
@@ -154,12 +156,12 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed
//cross similarity
val matrixCrossCooc = drmCooc(1).checkpoint().collect
- val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocBtAControl)
+ val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBControl)
n = (new MatrixOps(m = diff2Matrix)).norm
n should be < 1E-10
}
- test("cooccurrence two matrices with different number of columns"){
+ test("Cross-occurrence two matrices with different number of columns"){
val a = dense(
(1, 1, 0, 0, 0),
(0, 0, 1, 1, 0),
@@ -172,7 +174,7 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed
(0, 0, 1, 0),
(1, 1, 0, 1))
- val matrixLLRCoocBtANonSymmetric = dense(
+ val matrixLLRCoocAtBNonSymmetric = dense(
(0.0, 1.7260924347106847, 1.7260924347106847, 1.7260924347106847),
(0.0, 0.6795961471815897, 0.6795961471815897, 0.0),
(1.7260924347106847, 0.6795961471815897, 0.6795961471815897, 0.0),
@@ -191,7 +193,7 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed
//cross similarity
val matrixCrossCooc = drmCooc(1).checkpoint().collect
- val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocBtANonSymmetric)
+ val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBNonSymmetric)
n = (new MatrixOps(m = diff2Matrix)).norm
//cooccurrence without LLR is just a A'B
@@ -199,6 +201,107 @@ class SimilarityAnalysisSuite extends FunSuite with MahoutSuite with Distributed
//val bp = 0
}
+ test("Cross-occurrence two IndexedDatasets"){
+ val a = dense(
+ (1, 1, 0, 0, 0),
+ (0, 0, 1, 1, 0),
+ (0, 0, 0, 0, 1),
+ (1, 0, 0, 1, 0))
+
+ val b = dense(
+ (0, 1, 1, 0),
+ (1, 1, 1, 0),
+ (0, 0, 1, 0),
+ (1, 1, 0, 1))
+
+ val users = Seq("u1", "u2", "u3", "u4")
+ val itemsA = Seq("a1", "a2", "a3", "a4", "a5")
+ val itemsB = Seq("b1", "b2", "b3", "b4")
+ val userDict = new BiDictionary(users)
+ val itemsADict = new BiDictionary(itemsA)
+ val itemsBDict = new BiDictionary(itemsB)
+
+ // this is downsampled to the top 2 values per row to match the calc
+ val matrixLLRCoocAtBNonSymmetric = dense(
+ (0.0, 1.7260924347106847, 1.7260924347106847, 0.0),
+ (0.0, 0.6795961471815897, 0.6795961471815897, 0.0),
+ (1.7260924347106847, 0.6795961471815897, 0.0, 0.0),
+ (5.545177444479561, 1.7260924347106847, 0.0, 0.0),
+ (0.0, 0.0, 0.6795961471815897, 0.0))
+
+ val drmA = drmParallelize(m = a, numPartitions = 2)
+ val drmB = drmParallelize(m = b, numPartitions = 2)
+
+ val aID = new IndexedDatasetSpark(drmA, userDict, itemsADict)
+ val bID = new IndexedDatasetSpark(drmB, userDict, itemsBDict)
+ val aD = DownsamplableCrossOccurrenceDataset(aID)
+ val bD = DownsamplableCrossOccurrenceDataset(bID, maxInterestingElements = 2)
+
+ //self similarity
+ val drmCooc = SimilarityAnalysis.crossOccurrenceDownsampled(List(aD, bD))
+ val matrixSelfCooc = drmCooc(0).matrix.checkpoint().collect
+ val diffMatrix = matrixSelfCooc.minus(matrixLLRCoocAtAControl)
+ var n = (new MatrixOps(m = diffMatrix)).norm
+ n should be < 1E-10
+
+ //cross similarity
+ val matrixCrossCooc = drmCooc(1).matrix.checkpoint().collect
+ val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBNonSymmetric)
+ n = (new MatrixOps(m = diff2Matrix)).norm
+ n should be < 1E-10
+ }
+
+ test("Cross-occurrence two IndexedDatasets LLR threshold"){
+ val a = dense(
+ (1, 1, 0, 0, 0),
+ (0, 0, 1, 1, 0),
+ (0, 0, 0, 0, 1),
+ (1, 0, 0, 1, 0))
+
+ val b = dense(
+ (0, 1, 1, 0),
+ (1, 1, 1, 0),
+ (0, 0, 1, 0),
+ (1, 1, 0, 1))
+
+ val users = Seq("u1", "u2", "u3", "u4")
+ val itemsA = Seq("a1", "a2", "a3", "a4", "a5")
+ val itemsB = Seq("b1", "b2", "b3", "b4")
+ val userDict = new BiDictionary(users)
+ val itemsADict = new BiDictionary(itemsA)
+ val itemsBDict = new BiDictionary(itemsB)
+
+ // this is downsampled to the top 2 values per row to match the calc but also uses a min llr threshold so
+ // the # per row is still applied but nothing gets past the min llr check
+ val matrixLLRCoocAtBNonSymmetric = dense(
+ (0.0, 1.7260924347106847, 1.7260924347106847, 0.0),
+ (0.0, 0.0, 0.0, 0.0),
+ (1.7260924347106847, 0.0, 0.0, 0.0),
+ (5.545177444479561, 1.7260924347106847, 0.0, 0.0),
+ (0.0, 0.0, 0.0, 0.0))
+
+ val drmA = drmParallelize(m = a, numPartitions = 2)
+ val drmB = drmParallelize(m = b, numPartitions = 2)
+
+ val aID = new IndexedDatasetSpark(drmA, userDict, itemsADict)
+ val bID = new IndexedDatasetSpark(drmB, userDict, itemsBDict)
+ val aD = DownsamplableCrossOccurrenceDataset(aID)
+ val bD = DownsamplableCrossOccurrenceDataset(bID, minLLROpt = Some(1.7), maxInterestingElements = 2)
+
+ //self similarity
+ val drmCooc = SimilarityAnalysis.crossOccurrenceDownsampled(List(aD, bD))
+ val matrixSelfCooc = drmCooc(0).matrix.checkpoint().collect
+ val diffMatrix = matrixSelfCooc.minus(matrixLLRCoocAtAControl)
+ var n = (new MatrixOps(m = diffMatrix)).norm
+ n should be < 1E-10
+
+ //cross similarity
+ val matrixCrossCooc = drmCooc(1).matrix.checkpoint().collect
+ val diff2Matrix = matrixCrossCooc.minus(matrixLLRCoocAtBNonSymmetric)
+ n = (new MatrixOps(m = diff2Matrix)).norm
+ n should be < 1E-10
+ }
+
test("LLR calc") {
val A = dense(
(1, 1, 0, 0, 0),