You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@streams.apache.org by Matt Franklin <m....@gmail.com> on 2016/10/11 20:31:18 UTC

Re: [DISCUSS] What to do with streams-runtime-local and other streams-runtimes modules

On Tue, Sep 27, 2016 at 6:05 PM sblackmon <sb...@apache.org> wrote:

> All,
>
>
>
> Joey brought this up over the weekend and I think a discussion is overdue
> on the topic.
>
>
>
> Streams components were meant to be compatible with other runtime
> frameworks all along, and for the most part are implemented in a manner
> compatible with distributed execution where coordination, message passing,
> and lifecycle and handled outside of streams libraries.  By community
> standards any component or component configuration object that doesn't
> cleanly serializable for relocation in a distributed framework is a bug.
>

Agreed, though this could be more explicit.


>
>
>
> When the streams project got started in 2012 storm was the only TLP
> real-time data processing framework at apache, but now there are plenty of
> good choices all of which are faster and better tested than our
> streams-runtime-local module.
>
>

>
> So, what should be the role of streams-runtime-local?  Should we keep it
> at all?  The tests take forever to run and my organization has stopped
> using it entirely.  The best argument for keeping it is that it is useful
> when integration testing small pipelines, but perhaps we could just agree
> to use something else for that purpose?
>
>
I think having a local runtime for testing or small streams is valuable,
but there is a ton of work that needs to go into the current runtime.


>
>
> Do we want to keep the other runtime modules around and continue adding
> more?  I’ve found that when embedding streams components in other
> frameworks (spark and flink most recently) I end up creating a handful of
> classes to help bind streams interfaces and instances within the pdfs /
> functions / transforms / whatever are that framework atomic unit of
> computation and reusing them in all my pipelines.
>
>
I think this is valuable.  A set of libraries that adapt a common
programming model to various frameworks that simply stream development is
inherently cool. Write once, run anywhere.


>
>
> How about the StreamBuilder interface?  Does anyone still believe we
> should support (and still want to work on) classes
> implementing StreamBuilder to build and running a pipeline comprised solely
> of streams components on other frameworks?  Personally I prefer to write
> code using the framework APIs at the pipeline level, and embed individual
> streams components at the step level.
>
>
I think this could be valuable if done better.  For instance, binding
classes to steps in the stream pipeline, rather than instances.  This would
let the aforementioned adapter libraries configure components using the
programming model declared by streams and setup pipelines in target
systems.


>
>
> Any other thoughts on the topic?
>
>
>
> Steve
>
>
>
> - What should the focus be? If you look at the code, the project really
> provides 3 things: (1) a stream processing engine and integration with data
> persistence mechanisms, (2) a reference implementation of ActivityStreams,
> AS schemas, and tools for interlinking activity objects and events, and (3)
> a uniform API for integrating with social network APIs. I don't think that
> first thing is needed anymore. Just looking at Apache projects, NiFi, Apex
> + Apex Malhar, and to some extent Flume are further along here. Stream Sets
> covers some of this too, and arguably Logstash also gets used for this sort
> of work. I.e., I think the project would be much stronger if it focused on
> (2) and (3) and marrying those up to other Apache projects that fit (1).
> Minimally, it needs to be de-entangled a bit.

Re: [DISCUSS] What to do with streams-runtime-local and other streams-runtimes modules

Posted by Matt Franklin <m....@gmail.com>.
On Wed, Oct 12, 2016 at 2:55 PM Ryan Ebanks <ry...@gmail.com> wrote:

> Its been a long time since I contributed, but I would like to agree that
> local runtime should be trashed/re-written.  Preferably in a akka
> implementation that is lock free or as lock free as possible.  The current
> one is not very good, and I say that knowing I wrote a lot of it.  I also
> think it should be designed for local testing only and not for production
> use.  There are too many other frameworks to use to justify the work needed
> to fix/re-write it to a production standard.
>

Agreed this wouldn't be for production use cases.  The biggest issue with
making it reactive is the forced polling of providers.  If we change the
core programming model, it would make a lot of things much more simple.


>
> On Wed, Oct 12, 2016 at 11:21 AM, sblackmon <sb...@apache.org> wrote:
>
> On October 11, 2016 at 3:31:41 PM, Matt Franklin (m.ben.franklin@gmail.com)
> wrote:
>
> On Tue, Sep 27, 2016 at 6:05 PM sblackmon <sb...@apache.org> wrote:
>
>
>
> > All,
>
> >
>
> >
>
> >
>
> > Joey brought this up over the weekend and I think a discussion is overdue
>
> > on the topic.
>
> >
>
> >
>
> >
>
> > Streams components were meant to be compatible with other runtime
>
> > frameworks all along, and for the most part are implemented in a manner
>
> > compatible with distributed execution where coordination, message
> passing,
>
> > and lifecycle and handled outside of streams libraries. By community
>
> > standards any component or component configuration object that doesn't
>
> > cleanly serializable for relocation in a distributed framework is a bug.
>
> >
>
>
>
> Agreed, though this could be more explicit.
>
>
>
> Some modules contain a unit test that checks for serializability of
> components.  Maybe we can find a way to systematize this such that every
> Provider, Processor, and Persister added to the code base gets run through
> a serializability check during mvn test.  We could try to catch up by
> adding similar tests throughout the code base and -1 new submissions that
> don’t include such a test, but that approach seems harder than doing
> something using Reflections.
>
>
>
>
>
> >
>
> >
>
> >
>
> > When the streams project got started in 2012 storm was the only TLP
>
> > real-time data processing framework at apache, but now there are plenty
> of
>
> > good choices all of which are faster and better tested than our
>
> > streams-runtime-local module.
>
> >
>
> >
>
>
>
> >
>
> > So, what should be the role of streams-runtime-local? Should we keep it
>
> > at all? The tests take forever to run and my organization has stopped
>
> > using it entirely. The best argument for keeping it is that it is useful
>
> > when integration testing small pipelines, but perhaps we could just agree
>
> > to use something else for that purpose?
>
> >
>
> >
>
> I think having a local runtime for testing or small streams is valuable,
>
> but there is a ton of work that needs to go into the current runtime.
>
>
>
> Yeah, the magnitude of that effort is why it might be worth considering
> starting from scratch.  We need a quality testing harness runtime at a
> minimum. local is suitable for that, barely.
>
>
>
>
>
> >
>
> >
>
> > Do we want to keep the other runtime modules around and continue adding
>
> > more? I’ve found that when embedding streams components in other
>
> > frameworks (spark and flink most recently) I end up creating a handful of
>
> > classes to help bind streams interfaces and instances within the pdfs /
>
> > functions / transforms / whatever are that framework atomic unit of
>
> > computation and reusing them in all my pipelines.
>
> >
>
> >
>
> I think this is valuable. A set of libraries that adapt a common
>
> programming model to various frameworks that simply stream development is
>
> inherently cool. Write once, run anywhere.
>
>
>
> It’s a cool idea, but I’ve never successfully used it that way.  Also as
> soon as you bring a batch framework like pig, spark, or flink into your
> design, streams persisters quickly become irrelevant because performance is
> usually better using the framework preferred libraries.  Streaming
> frameworks not as much but there’s a trade-off to consider with every
> integration point and I’ve found pretty much universally that the
> framework-maintained libraries tend to be faster.
>
>
>
>
>
> >
>
> >
>
> > How about the StreamBuilder interface? Does anyone still believe we
>
> > should support (and still want to work on) classes
>
> > implementing StreamBuilder to build and running a pipeline comprised
> solely
>
> > of streams components on other frameworks? Personally I prefer to write
>
> > code using the framework APIs at the pipeline level, and embed individual
>
> > streams components at the step level.
>
> >
>
> >
>
> I think this could be valuable if done better. For instance, binding
>
> classes to steps in the stream pipeline, rather than instances. This would
>
> let the aforementioned adapter libraries configure components using the
>
> programming model declared by streams and setup pipelines in target
>
> systems.
>
>
>
> It’s a cool idea, but i think down that path we’d wind up with pretty
> beefy runtimes, loss of clean separation between modules, and unwanted
> transitive dependencies.  For example, an hdfs persist reader embedded in a
> spark pipeline should be interpreted as sc.readTextFile / readSequenceFile,
> or else spark.* properties that determine read behavior won’t be picked
> up.  An elastic search persist writer embedded in a flink pipeline should
> be interpreted to
> use org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSink.
> Or just
> maybe org.apache.flink.streaming.connectors.elasticsearch2.ElasticsearchSink depending
> on what’s available on the class path. Pretty soon each runtime becomes a
> crazy hard to test monolith because it has to be to properly optimize how
> it interprets each component.  That’s my fear anyway and it’s why I’m
> leaning more toward runtimes that don’t have a StreamBuilder at all, and
> just provide configuration support and static helper methods.
>
>
>
>
>
> >
>
> >
>
> > Any other thoughts on the topic?
>
> >
>
> >
>
> >
>
> > Steve
>
> >
>
> >
>
> >
>
> > - What should the focus be? If you look at the code, the project really
>
> > provides 3 things: (1) a stream processing engine and integration with
> data
>
> > persistence mechanisms, (2) a reference implementation of
> ActivityStreams,
>
> > AS schemas, and tools for interlinking activity objects and events, and
> (3)
>
> > a uniform API for integrating with social network APIs. I don't think
> that
>
> > first thing is needed anymore. Just looking at Apache projects, NiFi,
> Apex
>
> > + Apex Malhar, and to some extent Flume are further along here. Stream
> Sets
>
> > covers some of this too, and arguably Logstash also gets used for this
> sort
>
> > of work. I.e., I think the project would be much stronger if it focused
> on
>
> > (2) and (3) and marrying those up to other Apache projects that fit (1).
>
> > Minimally, it needs to be de-entangled a bit.
>
>
>
>

Re: [DISCUSS] What to do with streams-runtime-local and other streams-runtimes modules

Posted by sblackmon <sb...@apache.org>.
Preferably in a akka implementation that is lock free or as lock free as possible. 

Flink has a local environment (based on akka) that I’ve found to work very well for integration testing of flink data pipelines, which I think could also be suitable for testing and running streams component pipelines.

A few nice things about Flink’s architecture vis-a-vis Apache Streams:
* Sub-second spin-up and tear-down of local environment
* Near-real-time propagation of data between stages (not micro-batch)
* Exactly-once data guarantees from many sources, including kafka and hdfs, via checkpointing
* Ability to create a snapshot (checkpoint) on-demand and restart the pipeline, preserving state of 
* Most data in-flight is kept in binary blocks that spill to disk, so no out of heap crashes even when memory is constrained.

On October 12, 2016 at 1:55:18 PM, Ryan Ebanks (ryanebanks@gmail.com) wrote:
Its been a long time since I contributed, but I would like to agree that  
local runtime should be trashed/re-written. Preferably in a akka  
implementation that is lock free or as lock free as possible. The current  
one is not very good, and I say that knowing I wrote a lot of it. I also  
think it should be designed for local testing only and not for production  
use. There are too many other frameworks to use to justify the work needed  
to fix/re-write it to a production standard.  

On Wed, Oct 12, 2016 at 11:21 AM, sblackmon <sb...@apache.org> wrote:  

> On October 11, 2016 at 3:31:41 PM, Matt Franklin (m.ben.franklin@gmail.com)  
> wrote:  
> On Tue, Sep 27, 2016 at 6:05 PM sblackmon <sb...@apache.org> wrote:  
>  
> > All,  
> >  
> >  
> >  
> > Joey brought this up over the weekend and I think a discussion is overdue  
> > on the topic.  
> >  
> >  
> >  
> > Streams components were meant to be compatible with other runtime  
> > frameworks all along, and for the most part are implemented in a manner  
> > compatible with distributed execution where coordination, message  
> passing,  
> > and lifecycle and handled outside of streams libraries. By community  
> > standards any component or component configuration object that doesn't  
> > cleanly serializable for relocation in a distributed framework is a bug.  
> >  
>  
> Agreed, though this could be more explicit.  
>  
> Some modules contain a unit test that checks for serializability of  
> components. Maybe we can find a way to systematize this such that every  
> Provider, Processor, and Persister added to the code base gets run through  
> a serializability check during mvn test. We could try to catch up by  
> adding similar tests throughout the code base and -1 new submissions that  
> don’t include such a test, but that approach seems harder than doing  
> something using Reflections.  
>  
>  
> >  
> >  
> >  
> > When the streams project got started in 2012 storm was the only TLP  
> > real-time data processing framework at apache, but now there are plenty  
> of  
> > good choices all of which are faster and better tested than our  
> > streams-runtime-local module.  
> >  
> >  
>  
> >  
> > So, what should be the role of streams-runtime-local? Should we keep it  
> > at all? The tests take forever to run and my organization has stopped  
> > using it entirely. The best argument for keeping it is that it is useful  
> > when integration testing small pipelines, but perhaps we could just agree  
> > to use something else for that purpose?  
> >  
> >  
> I think having a local runtime for testing or small streams is valuable,  
> but there is a ton of work that needs to go into the current runtime.  
>  
> Yeah, the magnitude of that effort is why it might be worth considering  
> starting from scratch. We need a quality testing harness runtime at a  
> minimum. local is suitable for that, barely.  
>  
>  
> >  
> >  
> > Do we want to keep the other runtime modules around and continue adding  
> > more? I’ve found that when embedding streams components in other  
> > frameworks (spark and flink most recently) I end up creating a handful of  
> > classes to help bind streams interfaces and instances within the pdfs /  
> > functions / transforms / whatever are that framework atomic unit of  
> > computation and reusing them in all my pipelines.  
> >  
> >  
> I think this is valuable. A set of libraries that adapt a common  
> programming model to various frameworks that simply stream development is  
> inherently cool. Write once, run anywhere.  
>  
> It’s a cool idea, but I’ve never successfully used it that way. Also as  
> soon as you bring a batch framework like pig, spark, or flink into your  
> design, streams persisters quickly become irrelevant because performance is  
> usually better using the framework preferred libraries. Streaming  
> frameworks not as much but there’s a trade-off to consider with every  
> integration point and I’ve found pretty much universally that the  
> framework-maintained libraries tend to be faster.  
>  
>  
> >  
> >  
> > How about the StreamBuilder interface? Does anyone still believe we  
> > should support (and still want to work on) classes  
> > implementing StreamBuilder to build and running a pipeline comprised  
> solely  
> > of streams components on other frameworks? Personally I prefer to write  
> > code using the framework APIs at the pipeline level, and embed individual  
> > streams components at the step level.  
> >  
> >  
> I think this could be valuable if done better. For instance, binding  
> classes to steps in the stream pipeline, rather than instances. This would  
> let the aforementioned adapter libraries configure components using the  
> programming model declared by streams and setup pipelines in target  
> systems.  
>  
> It’s a cool idea, but i think down that path we’d wind up with pretty  
> beefy runtimes, loss of clean separation between modules, and unwanted  
> transitive dependencies. For example, an hdfs persist reader embedded in a  
> spark pipeline should be interpreted as sc.readTextFile / readSequenceFile,  
> or else spark.* properties that determine read behavior won’t be picked  
> up. An elastic search persist writer embedded in a flink pipeline should  
> be interpreted to use org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSink.  
> Or just maybe org.apache.flink.streaming.connectors.elasticsearch2.ElasticsearchSink depending  
> on what’s available on the class path. Pretty soon each runtime becomes a  
> crazy hard to test monolith because it has to be to properly optimize how  
> it interprets each component. That’s my fear anyway and it’s why I’m  
> leaning more toward runtimes that don’t have a StreamBuilder at all, and  
> just provide configuration support and static helper methods.  
>  
>  
> >  
> >  
> > Any other thoughts on the topic?  
> >  
> >  
> >  
> > Steve  
> >  
> >  
> >  
> > - What should the focus be? If you look at the code, the project really  
> > provides 3 things: (1) a stream processing engine and integration with  
> data  
> > persistence mechanisms, (2) a reference implementation of  
> ActivityStreams,  
> > AS schemas, and tools for interlinking activity objects and events, and  
> (3)  
> > a uniform API for integrating with social network APIs. I don't think  
> that  
> > first thing is needed anymore. Just looking at Apache projects, NiFi,  
> Apex  
> > + Apex Malhar, and to some extent Flume are further along here. Stream  
> Sets  
> > covers some of this too, and arguably Logstash also gets used for this  
> sort  
> > of work. I.e., I think the project would be much stronger if it focused  
> on  
> > (2) and (3) and marrying those up to other Apache projects that fit (1).  
> > Minimally, it needs to be de-entangled a bit.  
>  

Re: [DISCUSS] What to do with streams-runtime-local and other streams-runtimes modules

Posted by Ryan Ebanks <ry...@gmail.com>.
Its been a long time since I contributed, but I would like to agree that
local runtime should be trashed/re-written.  Preferably in a akka
implementation that is lock free or as lock free as possible.  The current
one is not very good, and I say that knowing I wrote a lot of it.  I also
think it should be designed for local testing only and not for production
use.  There are too many other frameworks to use to justify the work needed
to fix/re-write it to a production standard.

On Wed, Oct 12, 2016 at 11:21 AM, sblackmon <sb...@apache.org> wrote:

> On October 11, 2016 at 3:31:41 PM, Matt Franklin (m.ben.franklin@gmail.com)
> wrote:
> On Tue, Sep 27, 2016 at 6:05 PM sblackmon <sb...@apache.org> wrote:
>
> > All,
> >
> >
> >
> > Joey brought this up over the weekend and I think a discussion is overdue
> > on the topic.
> >
> >
> >
> > Streams components were meant to be compatible with other runtime
> > frameworks all along, and for the most part are implemented in a manner
> > compatible with distributed execution where coordination, message
> passing,
> > and lifecycle and handled outside of streams libraries. By community
> > standards any component or component configuration object that doesn't
> > cleanly serializable for relocation in a distributed framework is a bug.
> >
>
> Agreed, though this could be more explicit.
>
> Some modules contain a unit test that checks for serializability of
> components.  Maybe we can find a way to systematize this such that every
> Provider, Processor, and Persister added to the code base gets run through
> a serializability check during mvn test.  We could try to catch up by
> adding similar tests throughout the code base and -1 new submissions that
> don’t include such a test, but that approach seems harder than doing
> something using Reflections.
>
>
> >
> >
> >
> > When the streams project got started in 2012 storm was the only TLP
> > real-time data processing framework at apache, but now there are plenty
> of
> > good choices all of which are faster and better tested than our
> > streams-runtime-local module.
> >
> >
>
> >
> > So, what should be the role of streams-runtime-local? Should we keep it
> > at all? The tests take forever to run and my organization has stopped
> > using it entirely. The best argument for keeping it is that it is useful
> > when integration testing small pipelines, but perhaps we could just agree
> > to use something else for that purpose?
> >
> >
> I think having a local runtime for testing or small streams is valuable,
> but there is a ton of work that needs to go into the current runtime.
>
> Yeah, the magnitude of that effort is why it might be worth considering
> starting from scratch.  We need a quality testing harness runtime at a
> minimum. local is suitable for that, barely.
>
>
> >
> >
> > Do we want to keep the other runtime modules around and continue adding
> > more? I’ve found that when embedding streams components in other
> > frameworks (spark and flink most recently) I end up creating a handful of
> > classes to help bind streams interfaces and instances within the pdfs /
> > functions / transforms / whatever are that framework atomic unit of
> > computation and reusing them in all my pipelines.
> >
> >
> I think this is valuable. A set of libraries that adapt a common
> programming model to various frameworks that simply stream development is
> inherently cool. Write once, run anywhere.
>
> It’s a cool idea, but I’ve never successfully used it that way.  Also as
> soon as you bring a batch framework like pig, spark, or flink into your
> design, streams persisters quickly become irrelevant because performance is
> usually better using the framework preferred libraries.  Streaming
> frameworks not as much but there’s a trade-off to consider with every
> integration point and I’ve found pretty much universally that the
> framework-maintained libraries tend to be faster.
>
>
> >
> >
> > How about the StreamBuilder interface? Does anyone still believe we
> > should support (and still want to work on) classes
> > implementing StreamBuilder to build and running a pipeline comprised
> solely
> > of streams components on other frameworks? Personally I prefer to write
> > code using the framework APIs at the pipeline level, and embed individual
> > streams components at the step level.
> >
> >
> I think this could be valuable if done better. For instance, binding
> classes to steps in the stream pipeline, rather than instances. This would
> let the aforementioned adapter libraries configure components using the
> programming model declared by streams and setup pipelines in target
> systems.
>
> It’s a cool idea, but i think down that path we’d wind up with pretty
> beefy runtimes, loss of clean separation between modules, and unwanted
> transitive dependencies.  For example, an hdfs persist reader embedded in a
> spark pipeline should be interpreted as sc.readTextFile / readSequenceFile,
> or else spark.* properties that determine read behavior won’t be picked
> up.  An elastic search persist writer embedded in a flink pipeline should
> be interpreted to use org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSink.
> Or just maybe org.apache.flink.streaming.connectors.elasticsearch2.ElasticsearchSink depending
> on what’s available on the class path. Pretty soon each runtime becomes a
> crazy hard to test monolith because it has to be to properly optimize how
> it interprets each component.  That’s my fear anyway and it’s why I’m
> leaning more toward runtimes that don’t have a StreamBuilder at all, and
> just provide configuration support and static helper methods.
>
>
> >
> >
> > Any other thoughts on the topic?
> >
> >
> >
> > Steve
> >
> >
> >
> > - What should the focus be? If you look at the code, the project really
> > provides 3 things: (1) a stream processing engine and integration with
> data
> > persistence mechanisms, (2) a reference implementation of
> ActivityStreams,
> > AS schemas, and tools for interlinking activity objects and events, and
> (3)
> > a uniform API for integrating with social network APIs. I don't think
> that
> > first thing is needed anymore. Just looking at Apache projects, NiFi,
> Apex
> > + Apex Malhar, and to some extent Flume are further along here. Stream
> Sets
> > covers some of this too, and arguably Logstash also gets used for this
> sort
> > of work. I.e., I think the project would be much stronger if it focused
> on
> > (2) and (3) and marrying those up to other Apache projects that fit (1).
> > Minimally, it needs to be de-entangled a bit.
>

Re: [DISCUSS] What to do with streams-runtime-local and other streams-runtimes modules

Posted by sblackmon <sb...@apache.org>.
On October 11, 2016 at 3:31:41 PM, Matt Franklin (m.ben.franklin@gmail.com) wrote:
On Tue, Sep 27, 2016 at 6:05 PM sblackmon <sb...@apache.org> wrote:  

> All,  
>  
>  
>  
> Joey brought this up over the weekend and I think a discussion is overdue  
> on the topic.  
>  
>  
>  
> Streams components were meant to be compatible with other runtime  
> frameworks all along, and for the most part are implemented in a manner  
> compatible with distributed execution where coordination, message passing,  
> and lifecycle and handled outside of streams libraries. By community  
> standards any component or component configuration object that doesn't  
> cleanly serializable for relocation in a distributed framework is a bug.  
>  

Agreed, though this could be more explicit.  

Some modules contain a unit test that checks for serializability of components.  Maybe we can find a way to systematize this such that every Provider, Processor, and Persister added to the code base gets run through a serializability check during mvn test.  We could try to catch up by adding similar tests throughout the code base and -1 new submissions that don’t include such a test, but that approach seems harder than doing something using Reflections.


>  
>  
>  
> When the streams project got started in 2012 storm was the only TLP  
> real-time data processing framework at apache, but now there are plenty of  
> good choices all of which are faster and better tested than our  
> streams-runtime-local module.  
>  
>  

>  
> So, what should be the role of streams-runtime-local? Should we keep it  
> at all? The tests take forever to run and my organization has stopped  
> using it entirely. The best argument for keeping it is that it is useful  
> when integration testing small pipelines, but perhaps we could just agree  
> to use something else for that purpose?  
>  
>  
I think having a local runtime for testing or small streams is valuable,  
but there is a ton of work that needs to go into the current runtime.  

Yeah, the magnitude of that effort is why it might be worth considering starting from scratch.  We need a quality testing harness runtime at a minimum. local is suitable for that, barely.


>  
>  
> Do we want to keep the other runtime modules around and continue adding  
> more? I’ve found that when embedding streams components in other  
> frameworks (spark and flink most recently) I end up creating a handful of  
> classes to help bind streams interfaces and instances within the pdfs /  
> functions / transforms / whatever are that framework atomic unit of  
> computation and reusing them in all my pipelines.  
>  
>  
I think this is valuable. A set of libraries that adapt a common  
programming model to various frameworks that simply stream development is  
inherently cool. Write once, run anywhere.  

It’s a cool idea, but I’ve never successfully used it that way.  Also as soon as you bring a batch framework like pig, spark, or flink into your design, streams persisters quickly become irrelevant because performance is usually better using the framework preferred libraries.  Streaming frameworks not as much but there’s a trade-off to consider with every integration point and I’ve found pretty much universally that the framework-maintained libraries tend to be faster.  


>  
>  
> How about the StreamBuilder interface? Does anyone still believe we  
> should support (and still want to work on) classes  
> implementing StreamBuilder to build and running a pipeline comprised solely  
> of streams components on other frameworks? Personally I prefer to write  
> code using the framework APIs at the pipeline level, and embed individual  
> streams components at the step level.  
>  
>  
I think this could be valuable if done better. For instance, binding  
classes to steps in the stream pipeline, rather than instances. This would  
let the aforementioned adapter libraries configure components using the  
programming model declared by streams and setup pipelines in target  
systems.  

It’s a cool idea, but i think down that path we’d wind up with pretty beefy runtimes, loss of clean separation between modules, and unwanted transitive dependencies.  For example, an hdfs persist reader embedded in a spark pipeline should be interpreted as sc.readTextFile / readSequenceFile, or else spark.* properties that determine read behavior won’t be picked up.  An elastic search persist writer embedded in a flink pipeline should be interpreted to use org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSink. Or just maybe org.apache.flink.streaming.connectors.elasticsearch2.ElasticsearchSink depending on what’s available on the class path. Pretty soon each runtime becomes a crazy hard to test monolith because it has to be to properly optimize how it interprets each component.  That’s my fear anyway and it’s why I’m leaning more toward runtimes that don’t have a StreamBuilder at all, and just provide configuration support and static helper methods.


> 
> 
> Any other thoughts on the topic? 
> 
> 
> 
> Steve 
> 
> 
> 
> - What should the focus be? If you look at the code, the project really 
> provides 3 things: (1) a stream processing engine and integration with data 
> persistence mechanisms, (2) a reference implementation of ActivityStreams, 
> AS schemas, and tools for interlinking activity objects and events, and (3) 
> a uniform API for integrating with social network APIs. I don't think that 
> first thing is needed anymore. Just looking at Apache projects, NiFi, Apex 
> + Apex Malhar, and to some extent Flume are further along here. Stream Sets 
> covers some of this too, and arguably Logstash also gets used for this sort 
> of work. I.e., I think the project would be much stronger if it focused on 
> (2) and (3) and marrying those up to other Apache projects that fit (1). 
> Minimally, it needs to be de-entangled a bit.