You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Suneel Marthi (JIRA)" <ji...@apache.org> on 2013/12/02 02:58:35 UTC
[jira] [Closed] (MAHOUT-1154) Implementing Streaming KMeans
[ https://issues.apache.org/jira/browse/MAHOUT-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Suneel Marthi closed MAHOUT-1154.
---------------------------------
> Implementing Streaming KMeans
> -----------------------------
>
> Key: MAHOUT-1154
> URL: https://issues.apache.org/jira/browse/MAHOUT-1154
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.8
> Reporter: Dan Filimon
> Assignee: Dan Filimon
> Fix For: 0.8
>
>
> An implementation of Streaming KMeans as mentioned in [1] is available here [2].
> [1]http://mail-archives.apache.org/mod_mbox/mahout-dev/201303.mbox/%3CCAOwb3gOyf9zufrgXHsucpkJXk6cW0Nnr8GwG__JSey+kVABeyg@mail.gmail.com%3E
> [2] https://github.com/dfilimon/mahout
> Since there will be more than one patches, there will be specific JIRA issues that address each one.
> The description of the code being added is:
> The main classes are in o.a.m.clustering.streaming [1], under the
> core/ project. These are subdivided into 2 packages:
> - cluster: contains the BallKMeans and StreamingKMeans classes that
> can be used standalone.
> BallKMeans is exactly what it sounds like (uses k-means++ for the
> initialization, then does a normal k-means pass and ignoring
> outilers).
> StreamingKMeans implements the online clustering that doesn't return
> exactly k clusters, (it returns an estimate). This is used to
> approximate the data.
> - mapreduce: contains the CentroidWritable, StreamingKMeansDriver,
> StreamingKMeansMapper and StreamingKMeansReducer classes.
> CentroidWritable serializes Centroids (sort of like AbstractCluster).
> StreamingKMeansDriver provides the driver for the job.
> StreamingKMeansMapper runs StreamingKMeans in the mappers to produce
> sketches of the data for the reducer.
> StreamingKMeansReducer collects the centroids produced by the
> mappers into one set of weighted points and runs BallKMeans on them
> producing the final results.
> Additionally the searchers are in o.a.m.math.neighborhood
> - neighborhood: various searcher classes that implement nearest-neighbor
> search using different strategies.
> Searcher, UpdatableSearcher: abstract classes that define how to
> search through collections of vectors.
> BruteSearch: does a brute search (looks at every point...)
> ProjectionSearch: uses random projections for searching.
> FastProjectionSearch: also uses random projections (but not binary
> search trees as in ProjectionSearch).
> HashedVector, LocalitySensitiveHashSearch: implement locality
> sensitive hash search.
> All the tools that I used are in o.a.m.clustering.streaming [2], under
> the examples/ project.
> There are a bunch of classes here, covering everything from
> vectorizing 20 newsgroups data to various IO utils. The more important
> ones are:
> utils.ExperimentUtils: convenience methods.
> tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths.
> [3] https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming
> [4] https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming
> The relevant issues are:
> - MAHOUT-1155 (Centroid, WeightedVector)
> - MAHOUT-1156 (searchers)
> - MAHOUT-1162 (clustering, non map-reduce)
> - MAHOUT-1181 (map-reduce, command-line changes, pom.xml)
--
This message was sent by Atlassian JIRA
(v6.1#6144)