You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samoa.apache.org by Paris Carbone <pa...@kth.se> on 2015/05/04 14:28:13 UTC

Some feedback from building SAMOA adapters

Hello everyone,

First of all congratulations for putting your effort in creating Samoa and making it open source. While there is a lot of research around distributed streaming ML, there is great lack of actual tools, frameworks and programming models that let people use such techniques and compose custom pipelines.
My main experience with SAMOA so far is at its runtime level since I worked on adding Apache Flink as one of the main adapters together with Faye (CCed) . We generally did not face any great challenges, apart from the fact that we had do manual circle detection in order to support Flink's iterations but maybe we could sum up some observations here to consider:

* There was no documentation regarding the integration of new backend systems. We had to extract information from a thesis report and by looking into the existing adapters source code and structure to do that. Perhaps a doc guide for backend contributors would be very useful in the future.
* If you notice all adapters there is a lot of replicated logic for each system. That means that there is some room for more abstractions to generalise things such as parametrising, instantiating and deploying tasks.
* Regarding the programming model, I noticed that you sometimes use action triggers (e.g. ‘evaluate every x records’). You could maybe abstract trigger actions so you can reuse them in various components. Another advantage is that you could even expose them downwards from tasks so systems like Apache Flink, Crunch or Google Dataflow can override (and optimise) some of this logic using build-in windowing semantics. This is just a simple idea but I think there is in general potential in abstracting and exposing as much as possible while still keeping implementation complexities to a minimum.

We will keep and eye on the dev-list and participate actively with more feedback the more we use Samoa and find out needs. Currently, we are working on an experimental ML pipeline prototype that also works on streams so we will try to keep it as much in sync with Samoa as possible.

cheers
Paris, Faye

Re: Some feedback from building SAMOA adapters

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.

Hi Paris, Faye,

Thanks a lot for your effort and for contributing the adapter.
I am excited by the collaboration with the Flink community, which is very
active and I like very much.

Indeed most of the documentation we wrote is user-facing, and we have not
put much effort on the systems side.
The reason is that we expect more algorithms to be contributed than systems.
But definitely something to work on.

Indeed, right now the system-level API are just a very thin layer, and most
of the logic goes in the adapters.
This has the advantage that we can support quite different systems, but
indeed some of the code in the adapters is very similar.
We could probably try to factor some of it in abstract classes, but Java
doesn't allow mixins, and forcing adapters to extend our classes is a no-go.
Maybe we could have some common utility classes though.

The suggestions of exposing triggers is a very good one.
Right now the programming model of SAMOA is very close to message passing.
Moving towards a more dsl-like language, with higher semantics, would be
great.
The challenge is how to maintain flexibility and efficiency.
Triggers seem to be a very good starting point, I added it to the Roadmap.

I am very curious about the ML prototype.
Do you have any pointer to design documents or documentation (or even just
code)?

Cheers,
--
Gianmarco

On 4 May 2015 at 15:28, Paris Carbone <pa...@kth.se> wrote:

> Hello everyone,
>
> First of all congratulations for putting your effort in creating Samoa and
> making it open source. While there is a lot of research around distributed
> streaming ML, there is great lack of actual tools, frameworks and
> programming models that let people use such techniques and compose custom
> pipelines.
> My main experience with SAMOA so far is at its runtime level since I
> worked on adding Apache Flink as one of the main adapters together with
> Faye (CCed) . We generally did not face any great challenges, apart from
> the fact that we had do manual circle detection in order to support Flink's
> iterations but maybe we could sum up some observations here to consider:
>
>
>   *   There was no documentation regarding the integration of new backend
> systems. We had to extract information from a thesis report and by looking
> into the existing adapters source code and structure to do that. Perhaps a
> doc guide for backend contributors would be very useful in the future.
>   *   If you notice all adapters there is a lot of replicated logic for
> each system. That means that there is some room for more abstractions to
> generalise things such as parametrising, instantiating and deploying tasks.
>   *   Regarding the programming model, I noticed that you sometimes use
> action triggers (e.g. ‘evaluate  every x records’). You could maybe
> abstract trigger actions so you can reuse them in various components.
> Another advantage is that you could even expose them downwards from tasks
> so systems like Apache Flink, Crunch or Google Dataflow can override (and
> optimise) some of this logic using build-in windowing semantics. This is
> just a simple idea but I think there is in general potential in abstracting
> and exposing as much as possible while still keeping implementation
> complexities to a minimum.
>
> We will keep and eye on the dev-list and participate actively with more
> feedback the more we use Samoa and find out needs. Currently, we are
> working on an experimental ML pipeline prototype that also works on streams
> so we will try to keep it as much in sync with Samoa as possible.
>
> cheers
> Paris, Faye
>
>
>
>