You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Nasir Khan <na...@gmail.com> on 2014/03/17 01:56:43 UTC

Machine Learning on streaming data

hi, I m into a project in which i have to get streaming URL's and Filter it
and classify it as benin or suspicious. Now Machine Learning and Streaming
are two separate things in apache spark (AFAIK). my Question is Can we apply
Online Machine Learning Algorithms on Streams?? 

I am at Beginner Level, Kindly Explain in abit detail and if some one can
direct me to some good material for me will be greats.....

Thanks
Nasir Khan.  



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Machine-Learning-on-streaming-data-tp2732.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Machine Learning on streaming data

Posted by Pascal Voitot Dev <pa...@gmail.com>.
Hi,

I tried a few things on that in my last blog post on :
http://mandubian.com/2014/03/10/zpark-ml-nio-3/
(last part of a tryptic about spark & scalaz-stream)

I built a collaborative filtering and then use it on each RDD of the
DStream usingn a transform { rdd => model.predict(rdd)... }.
It works but I need to investigate what happens with the model being
potentially remoted... Not sure it's good (or not)

Pascal


On Thu, Mar 20, 2014 at 2:03 AM, Tathagata Das
<ta...@gmail.com>wrote:

> Yes, of course you can conceptually apply machine learning algorithm on
> Spark Streaming. However the current MLLib does not yet have direct support
> for Spark Streaming's DStream. However, since DStreams are essentially a
> sequence of RDDs, you can apply MLLib algorithms on those RDDs. Take a look
> at DStream.transform() and DStream.foreachRDD() operations, which allows
> you access RDDs in a DStream. You can apply MLLib functions on them.
>
> Some people have attempted to make a tighter integration between MLLib and
> Spark Streaming. Jeremy (cc'ed) can say more about his adventures.
>
> TD
>
>
> On Sun, Mar 16, 2014 at 5:56 PM, Nasir Khan <na...@gmail.com>wrote:
>
>> hi, I m into a project in which i have to get streaming URL's and Filter
>> it
>> and classify it as benin or suspicious. Now Machine Learning and Streaming
>> are two separate things in apache spark (AFAIK). my Question is Can we
>> apply
>> Online Machine Learning Algorithms on Streams??
>>
>> I am at Beginner Level, Kindly Explain in abit detail and if some one can
>> direct me to some good material for me will be greats.....
>>
>> Thanks
>> Nasir Khan.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Machine-Learning-on-streaming-data-tp2732.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Re: Machine Learning on streaming data

Posted by Jeremy Freeman <fr...@janelia.hhmi.org>.
Thanks TD, happy to share my experience with MLLib + Spark Streaming integration.

Here's a gist with two examples I have working, one for StreamingLinearRegression and another for StreamingKMeans.

https://gist.github.com/freeman-lab/9672685

The goal in each case was to implement a streaming version of the algorithm, using as much as possible directly from MLLib. For Linear Regression this was straightforward, because the MLLib version already uses a (stochastic) update rule, which I just use to update the model inside a foreachRDD(), using each new batch of data. For KMeans, I used the model class from MLLib, but extended it to keep a running count for each cluster. I also had to re-implement a chunk of the core algorithm in the form of an update rule. Tighter integration in this case would, I think, require refactoring some of MLLib (e.g. to use something like this update function), but this works fine.

One unresolved issue: for these kinds of algorithms, the dimensionality of the data must be known in advance. Would be cool to automatically detect it based on the first record.

-- Jeremy

On Mar 19, 2014, at 9:03 PM, Tathagata Das <ta...@gmail.com> wrote:

> Yes, of course you can conceptually apply machine learning algorithm on Spark Streaming. However the current MLLib does not yet have direct support for Spark Streaming's DStream. However, since DStreams are essentially a sequence of RDDs, you can apply MLLib algorithms on those RDDs. Take a look at DStream.transform() and DStream.foreachRDD() operations, which allows you access RDDs in a DStream. You can apply MLLib functions on them.
> 
> Some people have attempted to make a tighter integration between MLLib and Spark Streaming. Jeremy (cc'ed) can say more about his adventures. 
> 
> TD
> 
> 
> On Sun, Mar 16, 2014 at 5:56 PM, Nasir Khan <na...@gmail.com> wrote:
> hi, I m into a project in which i have to get streaming URL's and Filter it
> and classify it as benin or suspicious. Now Machine Learning and Streaming
> are two separate things in apache spark (AFAIK). my Question is Can we apply
> Online Machine Learning Algorithms on Streams??
> 
> I am at Beginner Level, Kindly Explain in abit detail and if some one can
> direct me to some good material for me will be greats.....
> 
> Thanks
> Nasir Khan.
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Machine-Learning-on-streaming-data-tp2732.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 


Re: Machine Learning on streaming data

Posted by Tathagata Das <ta...@gmail.com>.
Yes, of course you can conceptually apply machine learning algorithm on
Spark Streaming. However the current MLLib does not yet have direct support
for Spark Streaming's DStream. However, since DStreams are essentially a
sequence of RDDs, you can apply MLLib algorithms on those RDDs. Take a look
at DStream.transform() and DStream.foreachRDD() operations, which allows
you access RDDs in a DStream. You can apply MLLib functions on them.

Some people have attempted to make a tighter integration between MLLib and
Spark Streaming. Jeremy (cc'ed) can say more about his adventures.

TD


On Sun, Mar 16, 2014 at 5:56 PM, Nasir Khan <na...@gmail.com>wrote:

> hi, I m into a project in which i have to get streaming URL's and Filter it
> and classify it as benin or suspicious. Now Machine Learning and Streaming
> are two separate things in apache spark (AFAIK). my Question is Can we
> apply
> Online Machine Learning Algorithms on Streams??
>
> I am at Beginner Level, Kindly Explain in abit detail and if some one can
> direct me to some good material for me will be greats.....
>
> Thanks
> Nasir Khan.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Machine-Learning-on-streaming-data-tp2732.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>