You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@predictionio.apache.org by Marcin Ziemiński <zi...@gmail.com> on 2016/10/03 14:26:56 UTC

Re: Spark Streaming integration?

Hi James,

Great thanks for your suggestions!
I expect that the architecture of PredictionIO would have to be changed in
a significant way to provide such functionalities.
The current design is quite rigid and we are aiming first at modularization
of key components, so that plugging in various mechanisms of collecting and
processing data should be possible.

Regarding the points:
1,2. This is the key issue, as the present engine design is about training
models statically on a side. It's hard for me to imagine stream processing
inside engines in a clean way as it is now.
3,4. This is why modularization is a key step that needs to be addressed
first to provide alternatives to Eventserver or some kind of data sinks.
Data retention policies seem like another important issue too, which will
require a broader discussion.

Regards,
Marcin



pt., 30.09.2016 o 20:51 użytkownik James Wu <ja...@gmail.com> napisał:

> Hi Marcin,
>
> Thanks for the reply. I've just spent a couple of days looking into
> prediction.io tutorials, so forgive me if I missed something. :)
>
> IMO there are several pieces that needs to work together to make
> streaming work really well:
> 1. Online learning algorithm - MLlib provides 2 streaming algorithm
> (linear regression and k-means). Arguably more algorithms would be
> needed. This falls outside the prerogative of prediction.io, but it
> could mean that prediction.io need to integrate with other platform
> that provides online learning. I've been looking at other online
> learning projects but haven't got a good grasp on the landscape of
> this area.
> 2. Model training - it seems current prediction.io framework can be
> tweaked a little to make this work with MLlib streaming algorithms?
> Basically move model training to 'pio deploy' step, where the model is
> trained using DStream.
> 3. Data ingestion - There is probably going to be 2 modes for online
> training: store no data, or store data under a retention policy. We
> need a real-time ingestion mechanism other than REST (e.g. Kafka) as
> well.
> 4. Prediction - existing prediction API is relevant, but should also
> consider proactive predictions (like suggesting anomalies in data) and
> feedback mechanism. Perhaps we need "data sink" concept which can
> proactively generate notifications.
>
> Please let me know what you think.
>
> Thanks! James
>
> On Fri, Sep 30, 2016 at 4:01 AM, Marcin Ziemiński <zi...@gmail.com>
> wrote:
> > Hi James,
> >
> > Incorporating Spark Streaming or Structured Streaming in PredictionIO
> will
> > probably involve significant changes in the architecture. We are
> currently
> > in the state of rethinking the design, so that it could enable different
> > approaches of processing data. Future releases (after 0.10) should bring
> > some changes, I hope that introducing streaming will be one of them.
> > Do you have any thoughts on how you imagine putting stream processing
> into
> > PredictionIO and using it this way? Any input in this matter would be of
> a
> > great value.
> >
> > Thanks,
> > Marcin
> >
> > pt., 30.09.2016 o 01:02 użytkownik James Wu <ja...@gmail.com>
> napisał:
> >>
> >> Hi,
> >>
> >> Are there any plans for prediction.io to integrate with Spark
> >> Streaming and support the streaming algorithms in MLlib?
> >>
> >> Thanks! James
>