You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Debasish Ghosh <gh...@gmail.com> on 2016/11/17 14:03:10 UTC

outlier detection using StreamingKMeans

Hello -

I am trying to implement an outlier detection application on streaming
data. I am a newbie to Spark and hence would like some advice on the
confusions that I have ..

I am thinking of using StreamingKMeans - is this a good choice ? I have one
stream of data and I need an online algorithm. But here are some questions
that immediately come to my mind ..

   1. I cannot do separate training, cross validation etc. Is this a good
   idea to do training and prediction online ?
   2. The data will be read from the stream coming from Kafka in
   microbatches of (say) 3 seconds. I get a DStream on which I train and
   get the clusters. How can I decide on the number of clusters ? Using
   StreamingKMeans is there any way I can iterate on microbatches with
   different values of k to find the optimal one ?
   3. Even if I fix k, after training on every microbatch I get a DStream.
   How can I compute things like clustering score on the DStream ?
   StreamingKMeansModel has a computeCost function but it takes an RDD. May
   be using DStream.foreachRDD { //.. can work, but I am not able to figure
   out how. How can we compute the cost of clustering for an unbounded list of
   data ? Any idiomatic way to handle this ?

Or is StreamingKMeans is not the right choice to do anomaly detection in an
online setting ..

any suggestion will be welcome ..

regards.

-- 
Debasish Ghosh
http://manning.com/ghosh2
http://manning.com/ghosh

Twttr: @debasishg
Blog: http://debasishg.blogspot.com
Code: http://github.com/debasishg