You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Debasish Ghosh <gh...@gmail.com> on 2016/11/17 14:03:10 UTC
outlier detection using StreamingKMeans
Hello -
I am trying to implement an outlier detection application on streaming
data. I am a newbie to Spark and hence would like some advice on the
confusions that I have ..
I am thinking of using StreamingKMeans - is this a good choice ? I have one
stream of data and I need an online algorithm. But here are some questions
that immediately come to my mind ..
1. I cannot do separate training, cross validation etc. Is this a good
idea to do training and prediction online ?
2. The data will be read from the stream coming from Kafka in
microbatches of (say) 3 seconds. I get a DStream on which I train and
get the clusters. How can I decide on the number of clusters ? Using
StreamingKMeans is there any way I can iterate on microbatches with
different values of k to find the optimal one ?
3. Even if I fix k, after training on every microbatch I get a DStream.
How can I compute things like clustering score on the DStream ?
StreamingKMeansModel has a computeCost function but it takes an RDD. May
be using DStream.foreachRDD { //.. can work, but I am not able to figure
out how. How can we compute the cost of clustering for an unbounded list of
data ? Any idiomatic way to handle this ?
Or is StreamingKMeans is not the right choice to do anomaly detection in an
online setting ..
any suggestion will be welcome ..
regards.
--
Debasish Ghosh
http://manning.com/ghosh2
http://manning.com/ghosh
Twttr: @debasishg
Blog: http://debasishg.blogspot.com
Code: http://github.com/debasishg