You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Suneel Marthi (JIRA)" <ji...@apache.org> on 2013/12/02 02:58:35 UTC

[jira] [Closed] (MAHOUT-1154) Implementing Streaming KMeans

     [ https://issues.apache.org/jira/browse/MAHOUT-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suneel Marthi closed MAHOUT-1154.
---------------------------------


> Implementing Streaming KMeans
> -----------------------------
>
>                 Key: MAHOUT-1154
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1154
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Dan Filimon
>            Assignee: Dan Filimon
>             Fix For: 0.8
>
>
> An implementation of Streaming KMeans as mentioned in [1] is available here [2].
> [1]http://mail-archives.apache.org/mod_mbox/mahout-dev/201303.mbox/%3CCAOwb3gOyf9zufrgXHsucpkJXk6cW0Nnr8GwG__JSey+kVABeyg@mail.gmail.com%3E
> [2] https://github.com/dfilimon/mahout
> Since there will be more than one patches, there will be specific JIRA issues that address each one.
> The description of the code being added is:
> The main classes are in o.a.m.clustering.streaming [1], under the
> core/ project. These are subdivided into 2 packages:
> - cluster: contains the BallKMeans and StreamingKMeans classes that
> can be used standalone.
>   BallKMeans is exactly what it sounds like (uses k-means++ for the
> initialization, then does a normal k-means pass and ignoring
> outilers).
>   StreamingKMeans implements the online clustering that doesn't return
> exactly k clusters, (it returns an estimate). This is used to
> approximate the data.
> - mapreduce: contains the CentroidWritable, StreamingKMeansDriver,
> StreamingKMeansMapper and StreamingKMeansReducer classes.
>   CentroidWritable serializes Centroids (sort of like AbstractCluster).
>   StreamingKMeansDriver provides the driver for the job.
>   StreamingKMeansMapper runs StreamingKMeans in the mappers to produce
> sketches of the data for the reducer.
>   StreamingKMeansReducer collects the centroids produced by the
> mappers into one set of weighted points and runs BallKMeans on them
> producing the final results.
> Additionally the searchers are in o.a.m.math.neighborhood
> - neighborhood: various searcher classes that implement nearest-neighbor
> search using different strategies.
>   Searcher, UpdatableSearcher: abstract classes that define how to
> search through collections of vectors.
>   BruteSearch: does a brute search (looks at every point...)
>   ProjectionSearch: uses random projections for searching.
>   FastProjectionSearch: also uses random projections (but not binary
> search trees as in ProjectionSearch).
>   HashedVector, LocalitySensitiveHashSearch: implement locality
> sensitive hash search.
> All the tools that I used are in o.a.m.clustering.streaming [2], under
> the examples/ project.
> There are a bunch of classes here, covering everything from
> vectorizing 20 newsgroups data to various IO utils. The more important
> ones are:
>   utils.ExperimentUtils: convenience methods.
>   tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths.
> [3] https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming
> [4] https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming
> The relevant issues are:
> - MAHOUT-1155 (Centroid, WeightedVector)
> - MAHOUT-1156 (searchers)
> - MAHOUT-1162 (clustering, non map-reduce)
> - MAHOUT-1181 (map-reduce, command-line changes, pom.xml)



--
This message was sent by Atlassian JIRA
(v6.1#6144)