You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by debasishg <gh...@gmail.com> on 2016/11/19 16:46:41 UTC

using StreamingKMeans

Hello -

I am trying to implement an outlier detection application on streaming data.
I am a newbie to Spark and hence would like some advice on the confusions
that I have ..

I am thinking of using StreamingKMeans - is this a good choice ? I have one
stream of data and I need an online algorithm. But here are some questions
that immediately come to my mind ..

1. I cannot do separate training, cross validation etc. Is this a good idea
to do training and prediction online ?

2. The data will be read from the stream coming from Kafka in microbatches
of (say) 3 seconds. I get a DStream on which I train and get the clusters.
How can I decide on the number of clusters ? Using StreamingKMeans is there
any way I can iterate on microbatches with different values of k to find the
optimal one ?

3. Even if I fix k, after training on every microbatch I get a DStream. How
can I compute things like clustering score on the DStream ?
StreamingKMeansModel has a computeCost function but it takes an RDD. I can
use dstream.foreachRDD { // process RDD for the micro batch here } - is this
the idiomatic way ?

4. If I use dstream.foreachRDD { .. } and use functions like new
StandardScaler().fit(rdd) to do feature normalization, then it works when I
have data in the stream. But when the microbatch is empty (say I don't have
data for some time), the fit method throws exception as it gets an empty
collection. Things start working ok when data starts coming back to the
stream. But is this the way to go ?

any suggestion will be welcome ..

regards.

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/using-StreamingKMeans-tp28109.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: using StreamingKMeans

Posted by Julian Keppel <ju...@gmail.com>.

I do research in anomaly detection with methods of machine learning at the
moment. And currently I do kmeans clustering, too in an offline learning
setting. In further work we want to compare the two paradigms of offline
and online learning. I would like to share some thoughts on this
disscussion.

My offline setting is exactly what Guha Ayan explained: We collect data for
training and test over few days/weeks/month and train the model
periodically, lets say once a week for example. Please note that kmeans is
unsupervised, so it doesn't have any idea of what you data is about and
what could be "normal" or an anomaly. So in my opinion the training dataset
has to represent a state in which everything occors, the normal datapoints
and also as any anomalies. Referring to "Anomaly Detection: A Survey" from
Varun Chandola et. al. from 2009 there are different methods of
interpreting the results than. As an example: "Normal data instances belong
to large and dense clusters, while anomalies either belong to small or
sparse clusters". So potential anomalies have to be present in you
trainingdataset, I think.

The online learning setting is meant to adapt rapid changes in you
environment. So for example, if you are analyzing network traffinc, and you
add a new service which produces a lot of traffic (a lot of users use the
new service), than in an offline setting where you learn just once a week,
your new service may produce a false alarm whereas the online model would
adapt these changes (depending on the configured forgetfullness). There are
use cases where you have a very dynamic environment (for example flight
ticket prices), where you need to adapt you model rapidly (see for example
here: https://youtu.be/wyfTjd9z1sY).

2016-11-20 2:11 GMT+01:00 Debasish Ghosh <gh...@gmail.com>:

> I share both the concerns that u have expressed. And as I mentioned in my
> earlier mail, offline (batch) training is an option if I get a dataset
> without outliers. In that case I can train and have a model. I find the
> model parameters, which will be the mean distance to the centroid. Note in
> training I will have only 1 cluster as it's only normal data (no outlier).
>
> I can now pass these parameters to the prediction phase which can work on
> streaming data. In the prediction phase I just compute the distance to
> centroid for each point and flag the violating ones as outliers.
>
> This looks like a perfectly valid option if I get a dataset with no
> outliers to train on.
>
> Now my question is what then is the use case in which we can use
> StreamingKMeans ? In the above scenario we use batch KMeans in training
> phase while we just compute the distance in the prediction phase. And how
> do we address the scenario where we have only one stream of data available ?
>
> regards.
>
> On Sun, 20 Nov 2016 at 6:07 AM, ayan guha <gu...@gmail.com> wrote:
>
>> Here are 2 concerns I would have with the design (This discussion is
>> mostly to validate my own understanding)
>>
>> 1. if you have outliers "before" running k-means, aren't your centroids
>> get skewed? In other word, outliers by themselves may bias the cluster
>> evaluation, isn't it?
>> 2. Typically microbatches are small, like 3 sec in your case. in this
>> window you may not have enough data to run any statistically sigficant
>> operation, can you?
>>
>> My approach would have been: Run K-means on data without outliers (in
>> batch mode). Determine the model, ie centroids in case of kmeans. Then load
>> the model in your streaming app and just apply "outlier detection"
>> function, which takes the form of
>>
>> def detectOutlier(model,data):
>>       /// your code, like mean distance etc
>>       return T or F
>>
>> In response to your point about "alternet set of data", I would assume
>> you would accumulate the data you are receiving from streaming over few
>> weeks or months before running offline training.
>>
>> Am I missing something?
>>
>> On Sun, Nov 20, 2016 at 10:29 AM, Debasish Ghosh <
>> ghosh.debasish@gmail.com> wrote:
>>
>> Looking for alternative suggestions in case where we have 1 continuous
>> stream of data. Offline training and online prediction can be one option if
>> we can have an alternate set of data to train. But if it's one single
>> stream you don't have separate sets for training or cross validation.
>>
>> So whatever data u get in each micro batch, train on them and u get the
>> cluster centroids from the model. Then apply some heuristics like mean
>> distance from centroid and detect outliers. So for every microbatch u get
>> the outliers based on the model and u can control forgetfulness of the
>> model through the decay factor that u specify for StramingKMeans.
>>
>> Suggestions ?
>>
>> regards.
>>
>> On Sun, 20 Nov 2016 at 3:51 AM, ayan guha <gu...@gmail.com> wrote:
>>
>> Curious why do you want to train your models every 3 secs?
>> On 20 Nov 2016 06:25, "Debasish Ghosh" <gh...@gmail.com> wrote:
>>
>> Thanks a lot for the response.
>>
>> Regarding the sampling part - yeah that's what I need to do if there's no
>> way of titrating the number of clusters online.
>>
>> I am using something like
>>
>> dstream.foreachRDD { rdd =>
>>   if (rdd.count() > 0) { //.. logic
>>   }
>> }
>>
>> Feels a little odd but if that's the idiom then I will stick to it.
>>
>> regards.
>>
>>
>>
>> On Sat, Nov 19, 2016 at 10:52 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>
>> So I haven't played around with streaming k means at all, but given
>> that no one responded to your message a couple of days ago, I'll say
>> what I can.
>>
>> 1. Can you not sample out some % of the stream for training?
>> 2. Can you run multiple streams at the same time with different values
>> for k and compare their performance?
>> 3. foreachRDD is fine in general, can't speak to the specifics.
>> 4. If you haven't done any transformations yet on a direct stream,
>> foreachRDD will give you a KafkaRDD.  Checking if a KafkaRDD is empty
>> is very cheap, it's done on the driver only because the beginning and
>> ending offsets are known.  So you should be able to skip empty
>> batches.
>>
>>
>>
>> On Sat, Nov 19, 2016 at 10:46 AM, debasishg <gh...@gmail.com>
>> wrote:
>> > Hello -
>> >
>> > I am trying to implement an outlier detection application on streaming
>> data.
>> > I am a newbie to Spark and hence would like some advice on the
>> confusions
>> > that I have ..
>> >
>> > I am thinking of using StreamingKMeans - is this a good choice ? I have
>> one
>> > stream of data and I need an online algorithm. But here are some
>> questions
>> > that immediately come to my mind ..
>> >
>> > 1. I cannot do separate training, cross validation etc. Is this a good
>> idea
>> > to do training and prediction online ?
>> >
>> > 2. The data will be read from the stream coming from Kafka in
>> microbatches
>> > of (say) 3 seconds. I get a DStream on which I train and get the
>> clusters.
>> > How can I decide on the number of clusters ? Using StreamingKMeans is
>> there
>> > any way I can iterate on microbatches with different values of k to
>> find the
>> > optimal one ?
>> >
>> > 3. Even if I fix k, after training on every microbatch I get a DStream.
>> How
>> > can I compute things like clustering score on the DStream ?
>> > StreamingKMeansModel has a computeCost function but it takes an RDD. I
>> can
>> > use dstream.foreachRDD { // process RDD for the micro batch here } - is
>> this
>> > the idiomatic way ?
>> >
>> > 4. If I use dstream.foreachRDD { .. } and use functions like new
>> > StandardScaler().fit(rdd) to do feature normalization, then it works
>> when I
>> > have data in the stream. But when the microbatch is empty (say I don't
>> have
>> > data for some time), the fit method throws exception as it gets an empty
>> > collection. Things start working ok when data starts coming back to the
>> > stream. But is this the way to go ?
>> >
>> > any suggestion will be welcome ..
>> >
>> > regards.
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/using-StreamingKMeans-tp28109.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >
>>
>>
>>
>>
>> --
>> Debasish Ghosh
>> http://manning.com/ghosh2
>> http://manning.com/ghosh
>>
>> Twttr: @debasishg
>> Blog: http://debasishg.blogspot.com
>> Code: http://github.com/debasishg
>>
>> --
>> Sent from my iPhone
>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
> --
> Sent from my iPhone
>

Re: using StreamingKMeans

Posted by Debasish Ghosh <gh...@gmail.com>.

I share both the concerns that u have expressed. And as I mentioned in my
earlier mail, offline (batch) training is an option if I get a dataset
without outliers. In that case I can train and have a model. I find the
model parameters, which will be the mean distance to the centroid. Note in
training I will have only 1 cluster as it's only normal data (no outlier).

I can now pass these parameters to the prediction phase which can work on
streaming data. In the prediction phase I just compute the distance to
centroid for each point and flag the violating ones as outliers.

This looks like a perfectly valid option if I get a dataset with no
outliers to train on.

Now my question is what then is the use case in which we can use
StreamingKMeans ? In the above scenario we use batch KMeans in training
phase while we just compute the distance in the prediction phase. And how
do we address the scenario where we have only one stream of data available ?

regards.

On Sun, 20 Nov 2016 at 6:07 AM, ayan guha <gu...@gmail.com> wrote:

> Here are 2 concerns I would have with the design (This discussion is
> mostly to validate my own understanding)
>
> 1. if you have outliers "before" running k-means, aren't your centroids
> get skewed? In other word, outliers by themselves may bias the cluster
> evaluation, isn't it?
> 2. Typically microbatches are small, like 3 sec in your case. in this
> window you may not have enough data to run any statistically sigficant
> operation, can you?
>
> My approach would have been: Run K-means on data without outliers (in
> batch mode). Determine the model, ie centroids in case of kmeans. Then load
> the model in your streaming app and just apply "outlier detection"
> function, which takes the form of
>
> def detectOutlier(model,data):
>       /// your code, like mean distance etc
>       return T or F
>
> In response to your point about "alternet set of data", I would assume you
> would accumulate the data you are receiving from streaming over few weeks
> or months before running offline training.
>
> Am I missing something?
>
> On Sun, Nov 20, 2016 at 10:29 AM, Debasish Ghosh <ghosh.debasish@gmail.com
> > wrote:
>
> Looking for alternative suggestions in case where we have 1 continuous
> stream of data. Offline training and online prediction can be one option if
> we can have an alternate set of data to train. But if it's one single
> stream you don't have separate sets for training or cross validation.
>
> So whatever data u get in each micro batch, train on them and u get the
> cluster centroids from the model. Then apply some heuristics like mean
> distance from centroid and detect outliers. So for every microbatch u get
> the outliers based on the model and u can control forgetfulness of the
> model through the decay factor that u specify for StramingKMeans.
>
> Suggestions ?
>
> regards.
>
> On Sun, 20 Nov 2016 at 3:51 AM, ayan guha <gu...@gmail.com> wrote:
>
> Curious why do you want to train your models every 3 secs?
> On 20 Nov 2016 06:25, "Debasish Ghosh" <gh...@gmail.com> wrote:
>
> Thanks a lot for the response.
>
> Regarding the sampling part - yeah that's what I need to do if there's no
> way of titrating the number of clusters online.
>
> I am using something like
>
> dstream.foreachRDD { rdd =>
>   if (rdd.count() > 0) { //.. logic
>   }
> }
>
> Feels a little odd but if that's the idiom then I will stick to it.
>
> regards.
>
>
>
> On Sat, Nov 19, 2016 at 10:52 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
> So I haven't played around with streaming k means at all, but given
> that no one responded to your message a couple of days ago, I'll say
> what I can.
>
> 1. Can you not sample out some % of the stream for training?
> 2. Can you run multiple streams at the same time with different values
> for k and compare their performance?
> 3. foreachRDD is fine in general, can't speak to the specifics.
> 4. If you haven't done any transformations yet on a direct stream,
> foreachRDD will give you a KafkaRDD.  Checking if a KafkaRDD is empty
> is very cheap, it's done on the driver only because the beginning and
> ending offsets are known.  So you should be able to skip empty
> batches.
>
>
>
> On Sat, Nov 19, 2016 at 10:46 AM, debasishg <gh...@gmail.com>
> wrote:
> > Hello -
> >
> > I am trying to implement an outlier detection application on streaming
> data.
> > I am a newbie to Spark and hence would like some advice on the confusions
> > that I have ..
> >
> > I am thinking of using StreamingKMeans - is this a good choice ? I have
> one
> > stream of data and I need an online algorithm. But here are some
> questions
> > that immediately come to my mind ..
> >
> > 1. I cannot do separate training, cross validation etc. Is this a good
> idea
> > to do training and prediction online ?
> >
> > 2. The data will be read from the stream coming from Kafka in
> microbatches
> > of (say) 3 seconds. I get a DStream on which I train and get the
> clusters.
> > How can I decide on the number of clusters ? Using StreamingKMeans is
> there
> > any way I can iterate on microbatches with different values of k to find
> the
> > optimal one ?
> >
> > 3. Even if I fix k, after training on every microbatch I get a DStream.
> How
> > can I compute things like clustering score on the DStream ?
> > StreamingKMeansModel has a computeCost function but it takes an RDD. I
> can
> > use dstream.foreachRDD { // process RDD for the micro batch here } - is
> this
> > the idiomatic way ?
> >
> > 4. If I use dstream.foreachRDD { .. } and use functions like new
> > StandardScaler().fit(rdd) to do feature normalization, then it works
> when I
> > have data in the stream. But when the microbatch is empty (say I don't
> have
> > data for some time), the fit method throws exception as it gets an empty
> > collection. Things start working ok when data starts coming back to the
> > stream. But is this the way to go ?
> >
> > any suggestion will be welcome ..
> >
> > regards.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/using-StreamingKMeans-tp28109.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>
>
>
>
> --
> Debasish Ghosh
> http://manning.com/ghosh2
> http://manning.com/ghosh
>
> Twttr: @debasishg
> Blog: http://debasishg.blogspot.com
> Code: http://github.com/debasishg
>
> --
> Sent from my iPhone
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
-- 
Sent from my iPhone

Re: using StreamingKMeans

Posted by ayan guha <gu...@gmail.com>.

Here are 2 concerns I would have with the design (This discussion is mostly
to validate my own understanding)

1. if you have outliers "before" running k-means, aren't your centroids get
skewed? In other word, outliers by themselves may bias the cluster
evaluation, isn't it?
2. Typically microbatches are small, like 3 sec in your case. in this
window you may not have enough data to run any statistically sigficant
operation, can you?

My approach would have been: Run K-means on data without outliers (in batch
mode). Determine the model, ie centroids in case of kmeans. Then load the
model in your streaming app and just apply "outlier detection" function,
which takes the form of

def detectOutlier(model,data):
      /// your code, like mean distance etc
      return T or F

In response to your point about "alternet set of data", I would assume you
would accumulate the data you are receiving from streaming over few weeks
or months before running offline training.

Am I missing something?

On Sun, Nov 20, 2016 at 10:29 AM, Debasish Ghosh <gh...@gmail.com>
wrote:

> Looking for alternative suggestions in case where we have 1 continuous
> stream of data. Offline training and online prediction can be one option if
> we can have an alternate set of data to train. But if it's one single
> stream you don't have separate sets for training or cross validation.
>
> So whatever data u get in each micro batch, train on them and u get the
> cluster centroids from the model. Then apply some heuristics like mean
> distance from centroid and detect outliers. So for every microbatch u get
> the outliers based on the model and u can control forgetfulness of the
> model through the decay factor that u specify for StramingKMeans.
>
> Suggestions ?
>
> regards.
>
> On Sun, 20 Nov 2016 at 3:51 AM, ayan guha <gu...@gmail.com> wrote:
>
>> Curious why do you want to train your models every 3 secs?
>> On 20 Nov 2016 06:25, "Debasish Ghosh" <gh...@gmail.com> wrote:
>>
>> Thanks a lot for the response.
>>
>> Regarding the sampling part - yeah that's what I need to do if there's no
>> way of titrating the number of clusters online.
>>
>> I am using something like
>>
>> dstream.foreachRDD { rdd =>
>>   if (rdd.count() > 0) { //.. logic
>>   }
>> }
>>
>> Feels a little odd but if that's the idiom then I will stick to it.
>>
>> regards.
>>
>>
>>
>> On Sat, Nov 19, 2016 at 10:52 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>
>> So I haven't played around with streaming k means at all, but given
>> that no one responded to your message a couple of days ago, I'll say
>> what I can.
>>
>> 1. Can you not sample out some % of the stream for training?
>> 2. Can you run multiple streams at the same time with different values
>> for k and compare their performance?
>> 3. foreachRDD is fine in general, can't speak to the specifics.
>> 4. If you haven't done any transformations yet on a direct stream,
>> foreachRDD will give you a KafkaRDD.  Checking if a KafkaRDD is empty
>> is very cheap, it's done on the driver only because the beginning and
>> ending offsets are known.  So you should be able to skip empty
>> batches.
>>
>>
>>
>> On Sat, Nov 19, 2016 at 10:46 AM, debasishg <gh...@gmail.com>
>> wrote:
>> > Hello -
>> >
>> > I am trying to implement an outlier detection application on streaming
>> data.
>> > I am a newbie to Spark and hence would like some advice on the
>> confusions
>> > that I have ..
>> >
>> > I am thinking of using StreamingKMeans - is this a good choice ? I have
>> one
>> > stream of data and I need an online algorithm. But here are some
>> questions
>> > that immediately come to my mind ..
>> >
>> > 1. I cannot do separate training, cross validation etc. Is this a good
>> idea
>> > to do training and prediction online ?
>> >
>> > 2. The data will be read from the stream coming from Kafka in
>> microbatches
>> > of (say) 3 seconds. I get a DStream on which I train and get the
>> clusters.
>> > How can I decide on the number of clusters ? Using StreamingKMeans is
>> there
>> > any way I can iterate on microbatches with different values of k to
>> find the
>> > optimal one ?
>> >
>> > 3. Even if I fix k, after training on every microbatch I get a DStream.
>> How
>> > can I compute things like clustering score on the DStream ?
>> > StreamingKMeansModel has a computeCost function but it takes an RDD. I
>> can
>> > use dstream.foreachRDD { // process RDD for the micro batch here } - is
>> this
>> > the idiomatic way ?
>> >
>> > 4. If I use dstream.foreachRDD { .. } and use functions like new
>> > StandardScaler().fit(rdd) to do feature normalization, then it works
>> when I
>> > have data in the stream. But when the microbatch is empty (say I don't
>> have
>> > data for some time), the fit method throws exception as it gets an empty
>> > collection. Things start working ok when data starts coming back to the
>> > stream. But is this the way to go ?
>> >
>> > any suggestion will be welcome ..
>> >
>> > regards.
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/using-StreamingKMeans-tp28109.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >
>>
>>
>>
>>
>> --
>> Debasish Ghosh
>> http://manning.com/ghosh2
>> http://manning.com/ghosh
>>
>> Twttr: @debasishg
>> Blog: http://debasishg.blogspot.com
>> Code: http://github.com/debasishg
>>
>> --
> Sent from my iPhone
>



-- 
Best Regards,
Ayan Guha

Re: using StreamingKMeans

Posted by Debasish Ghosh <gh...@gmail.com>.

Looking for alternative suggestions in case where we have 1 continuous
stream of data. Offline training and online prediction can be one option if
we can have an alternate set of data to train. But if it's one single
stream you don't have separate sets for training or cross validation.

So whatever data u get in each micro batch, train on them and u get the
cluster centroids from the model. Then apply some heuristics like mean
distance from centroid and detect outliers. So for every microbatch u get
the outliers based on the model and u can control forgetfulness of the
model through the decay factor that u specify for StramingKMeans.

Suggestions ?

regards.

On Sun, 20 Nov 2016 at 3:51 AM, ayan guha <gu...@gmail.com> wrote:

> Curious why do you want to train your models every 3 secs?
> On 20 Nov 2016 06:25, "Debasish Ghosh" <gh...@gmail.com> wrote:
>
> Thanks a lot for the response.
>
> Regarding the sampling part - yeah that's what I need to do if there's no
> way of titrating the number of clusters online.
>
> I am using something like
>
> dstream.foreachRDD { rdd =>
>   if (rdd.count() > 0) { //.. logic
>   }
> }
>
> Feels a little odd but if that's the idiom then I will stick to it.
>
> regards.
>
>
>
> On Sat, Nov 19, 2016 at 10:52 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
> So I haven't played around with streaming k means at all, but given
> that no one responded to your message a couple of days ago, I'll say
> what I can.
>
> 1. Can you not sample out some % of the stream for training?
> 2. Can you run multiple streams at the same time with different values
> for k and compare their performance?
> 3. foreachRDD is fine in general, can't speak to the specifics.
> 4. If you haven't done any transformations yet on a direct stream,
> foreachRDD will give you a KafkaRDD.  Checking if a KafkaRDD is empty
> is very cheap, it's done on the driver only because the beginning and
> ending offsets are known.  So you should be able to skip empty
> batches.
>
>
>
> On Sat, Nov 19, 2016 at 10:46 AM, debasishg <gh...@gmail.com>
> wrote:
> > Hello -
> >
> > I am trying to implement an outlier detection application on streaming
> data.
> > I am a newbie to Spark and hence would like some advice on the confusions
> > that I have ..
> >
> > I am thinking of using StreamingKMeans - is this a good choice ? I have
> one
> > stream of data and I need an online algorithm. But here are some
> questions
> > that immediately come to my mind ..
> >
> > 1. I cannot do separate training, cross validation etc. Is this a good
> idea
> > to do training and prediction online ?
> >
> > 2. The data will be read from the stream coming from Kafka in
> microbatches
> > of (say) 3 seconds. I get a DStream on which I train and get the
> clusters.
> > How can I decide on the number of clusters ? Using StreamingKMeans is
> there
> > any way I can iterate on microbatches with different values of k to find
> the
> > optimal one ?
> >
> > 3. Even if I fix k, after training on every microbatch I get a DStream.
> How
> > can I compute things like clustering score on the DStream ?
> > StreamingKMeansModel has a computeCost function but it takes an RDD. I
> can
> > use dstream.foreachRDD { // process RDD for the micro batch here } - is
> this
> > the idiomatic way ?
> >
> > 4. If I use dstream.foreachRDD { .. } and use functions like new
> > StandardScaler().fit(rdd) to do feature normalization, then it works
> when I
> > have data in the stream. But when the microbatch is empty (say I don't
> have
> > data for some time), the fit method throws exception as it gets an empty
> > collection. Things start working ok when data starts coming back to the
> > stream. But is this the way to go ?
> >
> > any suggestion will be welcome ..
> >
> > regards.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/using-StreamingKMeans-tp28109.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>
>
>
>
> --
> Debasish Ghosh
> http://manning.com/ghosh2
> http://manning.com/ghosh
>
> Twttr: @debasishg
> Blog: http://debasishg.blogspot.com
> Code: http://github.com/debasishg
>
> --
Sent from my iPhone

Re: using StreamingKMeans

Posted by ayan guha <gu...@gmail.com>.

Curious why do you want to train your models every 3 secs?
On 20 Nov 2016 06:25, "Debasish Ghosh" <gh...@gmail.com> wrote:

> Thanks a lot for the response.
>
> Regarding the sampling part - yeah that's what I need to do if there's no
> way of titrating the number of clusters online.
>
> I am using something like
>
> dstream.foreachRDD { rdd =>
>   if (rdd.count() > 0) { //.. logic
>   }
> }
>
> Feels a little odd but if that's the idiom then I will stick to it.
>
> regards.
>
>
>
> On Sat, Nov 19, 2016 at 10:52 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
>> So I haven't played around with streaming k means at all, but given
>> that no one responded to your message a couple of days ago, I'll say
>> what I can.
>>
>> 1. Can you not sample out some % of the stream for training?
>> 2. Can you run multiple streams at the same time with different values
>> for k and compare their performance?
>> 3. foreachRDD is fine in general, can't speak to the specifics.
>> 4. If you haven't done any transformations yet on a direct stream,
>> foreachRDD will give you a KafkaRDD.  Checking if a KafkaRDD is empty
>> is very cheap, it's done on the driver only because the beginning and
>> ending offsets are known.  So you should be able to skip empty
>> batches.
>>
>>
>>
>> On Sat, Nov 19, 2016 at 10:46 AM, debasishg <gh...@gmail.com>
>> wrote:
>> > Hello -
>> >
>> > I am trying to implement an outlier detection application on streaming
>> data.
>> > I am a newbie to Spark and hence would like some advice on the
>> confusions
>> > that I have ..
>> >
>> > I am thinking of using StreamingKMeans - is this a good choice ? I have
>> one
>> > stream of data and I need an online algorithm. But here are some
>> questions
>> > that immediately come to my mind ..
>> >
>> > 1. I cannot do separate training, cross validation etc. Is this a good
>> idea
>> > to do training and prediction online ?
>> >
>> > 2. The data will be read from the stream coming from Kafka in
>> microbatches
>> > of (say) 3 seconds. I get a DStream on which I train and get the
>> clusters.
>> > How can I decide on the number of clusters ? Using StreamingKMeans is
>> there
>> > any way I can iterate on microbatches with different values of k to
>> find the
>> > optimal one ?
>> >
>> > 3. Even if I fix k, after training on every microbatch I get a DStream.
>> How
>> > can I compute things like clustering score on the DStream ?
>> > StreamingKMeansModel has a computeCost function but it takes an RDD. I
>> can
>> > use dstream.foreachRDD { // process RDD for the micro batch here } - is
>> this
>> > the idiomatic way ?
>> >
>> > 4. If I use dstream.foreachRDD { .. } and use functions like new
>> > StandardScaler().fit(rdd) to do feature normalization, then it works
>> when I
>> > have data in the stream. But when the microbatch is empty (say I don't
>> have
>> > data for some time), the fit method throws exception as it gets an empty
>> > collection. Things start working ok when data starts coming back to the
>> > stream. But is this the way to go ?
>> >
>> > any suggestion will be welcome ..
>> >
>> > regards.
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/using-StreamingKMeans-tp28109.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >
>>
>
>
>
> --
> Debasish Ghosh
> http://manning.com/ghosh2
> http://manning.com/ghosh
>
> Twttr: @debasishg
> Blog: http://debasishg.blogspot.com
> Code: http://github.com/debasishg
>

Re: using StreamingKMeans

Posted by Debasish Ghosh <gh...@gmail.com>.

Thanks a lot for the response.

Regarding the sampling part - yeah that's what I need to do if there's no
way of titrating the number of clusters online.

I am using something like

dstream.foreachRDD { rdd =>
  if (rdd.count() > 0) { //.. logic
  }
}

Feels a little odd but if that's the idiom then I will stick to it.

regards.



On Sat, Nov 19, 2016 at 10:52 PM, Cody Koeninger <co...@koeninger.org> wrote:

> So I haven't played around with streaming k means at all, but given
> that no one responded to your message a couple of days ago, I'll say
> what I can.
>
> 1. Can you not sample out some % of the stream for training?
> 2. Can you run multiple streams at the same time with different values
> for k and compare their performance?
> 3. foreachRDD is fine in general, can't speak to the specifics.
> 4. If you haven't done any transformations yet on a direct stream,
> foreachRDD will give you a KafkaRDD.  Checking if a KafkaRDD is empty
> is very cheap, it's done on the driver only because the beginning and
> ending offsets are known.  So you should be able to skip empty
> batches.
>
>
>
> On Sat, Nov 19, 2016 at 10:46 AM, debasishg <gh...@gmail.com>
> wrote:
> > Hello -
> >
> > I am trying to implement an outlier detection application on streaming
> data.
> > I am a newbie to Spark and hence would like some advice on the confusions
> > that I have ..
> >
> > I am thinking of using StreamingKMeans - is this a good choice ? I have
> one
> > stream of data and I need an online algorithm. But here are some
> questions
> > that immediately come to my mind ..
> >
> > 1. I cannot do separate training, cross validation etc. Is this a good
> idea
> > to do training and prediction online ?
> >
> > 2. The data will be read from the stream coming from Kafka in
> microbatches
> > of (say) 3 seconds. I get a DStream on which I train and get the
> clusters.
> > How can I decide on the number of clusters ? Using StreamingKMeans is
> there
> > any way I can iterate on microbatches with different values of k to find
> the
> > optimal one ?
> >
> > 3. Even if I fix k, after training on every microbatch I get a DStream.
> How
> > can I compute things like clustering score on the DStream ?
> > StreamingKMeansModel has a computeCost function but it takes an RDD. I
> can
> > use dstream.foreachRDD { // process RDD for the micro batch here } - is
> this
> > the idiomatic way ?
> >
> > 4. If I use dstream.foreachRDD { .. } and use functions like new
> > StandardScaler().fit(rdd) to do feature normalization, then it works
> when I
> > have data in the stream. But when the microbatch is empty (say I don't
> have
> > data for some time), the fit method throws exception as it gets an empty
> > collection. Things start working ok when data starts coming back to the
> > stream. But is this the way to go ?
> >
> > any suggestion will be welcome ..
> >
> > regards.
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/using-StreamingKMeans-tp28109.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>



-- 
Debasish Ghosh
http://manning.com/ghosh2
http://manning.com/ghosh

Twttr: @debasishg
Blog: http://debasishg.blogspot.com
Code: http://github.com/debasishg

Re: using StreamingKMeans

Posted by Cody Koeninger <co...@koeninger.org>.

So I haven't played around with streaming k means at all, but given
that no one responded to your message a couple of days ago, I'll say
what I can.

1. Can you not sample out some % of the stream for training?
2. Can you run multiple streams at the same time with different values
for k and compare their performance?
3. foreachRDD is fine in general, can't speak to the specifics.
4. If you haven't done any transformations yet on a direct stream,
foreachRDD will give you a KafkaRDD.  Checking if a KafkaRDD is empty
is very cheap, it's done on the driver only because the beginning and
ending offsets are known.  So you should be able to skip empty
batches.



On Sat, Nov 19, 2016 at 10:46 AM, debasishg <gh...@gmail.com> wrote:
> Hello -
>
> I am trying to implement an outlier detection application on streaming data.
> I am a newbie to Spark and hence would like some advice on the confusions
> that I have ..
>
> I am thinking of using StreamingKMeans - is this a good choice ? I have one
> stream of data and I need an online algorithm. But here are some questions
> that immediately come to my mind ..
>
> 1. I cannot do separate training, cross validation etc. Is this a good idea
> to do training and prediction online ?
>
> 2. The data will be read from the stream coming from Kafka in microbatches
> of (say) 3 seconds. I get a DStream on which I train and get the clusters.
> How can I decide on the number of clusters ? Using StreamingKMeans is there
> any way I can iterate on microbatches with different values of k to find the
> optimal one ?
>
> 3. Even if I fix k, after training on every microbatch I get a DStream. How
> can I compute things like clustering score on the DStream ?
> StreamingKMeansModel has a computeCost function but it takes an RDD. I can
> use dstream.foreachRDD { // process RDD for the micro batch here } - is this
> the idiomatic way ?
>
> 4. If I use dstream.foreachRDD { .. } and use functions like new
> StandardScaler().fit(rdd) to do feature normalization, then it works when I
> have data in the stream. But when the microbatch is empty (say I don't have
> data for some time), the fit method throws exception as it gets an empty
> collection. Things start working ok when data starts coming back to the
> stream. But is this the way to go ?
>
> any suggestion will be welcome ..
>
> regards.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/using-StreamingKMeans-tp28109.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org