You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Márton Balassi <mb...@ilab.sztaki.hu> on 2014/07/02 14:05:32 UTC

Fwd: Adding the streaming project to the main repository

To extend the functionality of Flink a separate branch of development was
dedicated for low latency, distributed stream processing support. The
development started during March of 2014 and is approaching a state where
it might be considered a candidate for becoming part of the main repository.

As of today a WordCount
<https://github.com/stratosphere/stratosphere-streaming/blob/master/src/main/java/eu/stratosphere/streaming/examples/wordcount/WordCountLocal.java#L30-41>
example streaming program would fairly similar to the one that the batch
API provides:

StreamExecutionEnvironment env = new StreamExecutionEnvironment();

DataStream<Tuple2<String, Integer>> dataStream =

  env.readTextFile("src/test/resources/testdata/hamlet.txt")
                                                         .flatMap(new
WordCountSplitter())
                                                         .partitionBy(0)
                                                         .map(new
WordCountCounter());

dataStream.print();

env.execute();

The user defined functions are extending the same classes as in the batch
case (e.g. a FlatMapFunction for a flatmap, see WordCountSplitter
<https://github.com/stratosphere/stratosphere-streaming/blob/master/src/main/java/eu/stratosphere/streaming/examples/wordcount/WordCountSplitter.java>)
thus providing code interusability between the two approaches.

As for performance the 0.1 version
<https://github.com/stratosphere/stratosphere-streaming/tree/release-0.1>
released in the beginning of June was slightly better on a single core then
Apache Storm, one of the major players of the field. Cluster performance
needs further optimization. This version provided a lower level API, fairly
similar to the one Storm has. For a deeper dive on this state of the
development and the challenges faced please refer to the slides
<http://info.ilab.sztaki.hu/~mbalassi/dw_forum_2014/dw_forum_streaming.pdf>
of a talk form the early days of June.

The 0.2 release is coming soon with the the above demonstrated new API and
improved single core performance. To complete the release the cluster
performance is being measured, and the code is being decomposed into three
subprojects separating core, example and addon functionality.

As for the future fault tolerance is an unresolved issue and as a part of
the Google Summer of Code project an intern is working on iterative stream
processing.

The project is mainly developed at Budapest by three members employed by
Hungarian Academy of Sciences and Eötvös Loránd University and Frank Wu,
our Google Summer of Code student from Singapore. This summer the Hungarian
Academy of Sciences also dedicated 4 interns to the project.

The proposed 0.2 release is still dependant on the 0.5 release of
Stratosphere, however on branch snapshot-0.6
<https://github.com/stratosphere/stratosphere-streaming/tree/snapshot-0.6> the
dependencies are updated to 0.6-snapshot, thus the codebase is ready for
becoming part of the main project - preferably a part of addons until it
becomes stable.

Looking forward to your suggestions.

Cheers,

Márton, Gyula & Gábor

Re: Adding the streaming project to the main repository

Posted by Gyula Fóra <gy...@gmail.com>.

The utilites that we used for performance measurements have no direct
connections to this project. We thought it would make sense to move them
out into a separate repo since we are constantly modifying the settings for
the actual tests.

On Mon, Jul 7, 2014 at 2:30 PM, Ufuk Celebi <u....@fu-berlin.de> wrote:

>
> On 07 Jul 2014, at 12:06, Márton Balassi <ba...@gmail.com> wrote:
>
> > Yeah, this might be slightly confusing - for clarifying the situation:
> >
> >
> >   - Right under the streaming-addons one can find basic connectors for
> >   message queue services - at the moment Kafka and RabbitMQ. We
> considered
> >   this "classical" addon functionality.
> >   - Additionally the job used for performance measurements is also under
> >   addons, but I'm removing it.
> >   - As usually addons mean surplus dependencies I am for separating them.
> >   Would you suggest another name then? Streaming-connectors e.g.?
>
> I also like connectors.
>
> Why are you removing the performance measurements stuff?

Re: Adding the streaming project to the main repository

Posted by Ufuk Celebi <u....@fu-berlin.de>.

On 07 Jul 2014, at 12:06, Márton Balassi <ba...@gmail.com> wrote:

> Yeah, this might be slightly confusing - for clarifying the situation:
> 
> 
>   - Right under the streaming-addons one can find basic connectors for
>   message queue services - at the moment Kafka and RabbitMQ. We considered
>   this "classical" addon functionality.
>   - Additionally the job used for performance measurements is also under
>   addons, but I'm removing it.
>   - As usually addons mean surplus dependencies I am for separating them.
>   Would you suggest another name then? Streaming-connectors e.g.?

I also like connectors.

Why are you removing the performance measurements stuff?

Re: Adding the streaming project to the main repository

Posted by Stephan Ewen <se...@apache.org>.

I like the name "connectors"

Re: Adding the streaming project to the main repository

Posted by Márton Balassi <ba...@gmail.com>.

Yeah, this might be slightly confusing - for clarifying the situation:


   - Right under the streaming-addons one can find basic connectors for
   message queue services - at the moment Kafka and RabbitMQ. We considered
   this "classical" addon functionality.
   - Additionally the job used for performance measurements is also under
   addons, but I'm removing it.
   - As usually addons mean surplus dependencies I am for separating them.
   Would you suggest another name then? Streaming-connectors e.g.?



On Mon, Jul 7, 2014 at 11:48 AM, Stephan Ewen <se...@apache.org> wrote:

> Hi!
>
> Thanks for the update!
>
> From y side, +1 for adding the code.
>
> One question though: What part of the code is in your "addons" project? I
> am wondering if that may cause confusion, because (as per the discussion
> via hangout last week), we want to add the streaming code initially to the
> "flink-addons" project.
>
> Stephan
>

Re: Adding the streaming project to the main repository

Posted by Stephan Ewen <se...@apache.org>.

Hi!

Thanks for the update!

>From y side, +1 for adding the code.

One question though: What part of the code is in your "addons" project? I
am wondering if that may cause confusion, because (as per the discussion
via hangout last week), we want to add the streaming code initially to the
"flink-addons" project.

Stephan

Re: Adding the streaming project to the main repository

Posted by Gyula Fóra <gy...@gmail.com>.

Hey,

I would like to give a quick update on the status of the flink streaming
project; all of our dependencies are now updated to the current
0.6-snapshot in our main branch, and the project is now decomposed into 3
subprojects: core, examples, and addons.

We have created a separate branch for our 0.2 release with dependencies to
0.5, however from now on we will focus our development efforts to be able
to merge our main branch with the Main Flink project.

Regards,

Gyula, Márton & Gábor


On Wed, Jul 2, 2014 at 2:05 PM, Márton Balassi <mb...@ilab.sztaki.hu>
wrote:

> To extend the functionality of Flink a separate branch of development was
> dedicated for low latency, distributed stream processing support. The
> development started during March of 2014 and is approaching a state where
> it might be considered a candidate for becoming part of the main
> repository.
>
> As of today a WordCount
> <
> https://github.com/stratosphere/stratosphere-streaming/blob/master/src/main/java/eu/stratosphere/streaming/examples/wordcount/WordCountLocal.java#L30-41
> >
> example streaming program would fairly similar to the one that the batch
> API provides:
>
> StreamExecutionEnvironment env = new StreamExecutionEnvironment();
>
> DataStream<Tuple2<String, Integer>> dataStream =
>
>   env.readTextFile("src/test/resources/testdata/hamlet.txt")
>                                                          .flatMap(new
> WordCountSplitter())
>                                                          .partitionBy(0)
>                                                          .map(new
> WordCountCounter());
>
> dataStream.print();
>
> env.execute();
>
> The user defined functions are extending the same classes as in the batch
> case (e.g. a FlatMapFunction for a flatmap, see WordCountSplitter
> <
> https://github.com/stratosphere/stratosphere-streaming/blob/master/src/main/java/eu/stratosphere/streaming/examples/wordcount/WordCountSplitter.java
> >)
> thus providing code interusability between the two approaches.
>
> As for performance the 0.1 version
> <https://github.com/stratosphere/stratosphere-streaming/tree/release-0.1>
> released in the beginning of June was slightly better on a single core then
> Apache Storm, one of the major players of the field. Cluster performance
> needs further optimization. This version provided a lower level API, fairly
> similar to the one Storm has. For a deeper dive on this state of the
> development and the challenges faced please refer to the slides
> <http://info.ilab.sztaki.hu/~mbalassi/dw_forum_2014/dw_forum_streaming.pdf
> >
> of a talk form the early days of June.
>
> The 0.2 release is coming soon with the the above demonstrated new API and
> improved single core performance. To complete the release the cluster
> performance is being measured, and the code is being decomposed into three
> subprojects separating core, example and addon functionality.
>
> As for the future fault tolerance is an unresolved issue and as a part of
> the Google Summer of Code project an intern is working on iterative stream
> processing.
>
> The project is mainly developed at Budapest by three members employed by
> Hungarian Academy of Sciences and Eötvös Loránd University and Frank Wu,
> our Google Summer of Code student from Singapore. This summer the Hungarian
> Academy of Sciences also dedicated 4 interns to the project.
>
> The proposed 0.2 release is still dependant on the 0.5 release of
> Stratosphere, however on branch snapshot-0.6
> <https://github.com/stratosphere/stratosphere-streaming/tree/snapshot-0.6>
> the
> dependencies are updated to 0.6-snapshot, thus the codebase is ready for
> becoming part of the main project - preferably a part of addons until it
> becomes stable.
>
> Looking forward to your suggestions.
>
> Cheers,
>
> Márton, Gyula & Gábor
>