You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Jon Yeargers <jo...@cedexis.com> on 2017/03/23 18:51:25 UTC

APPLICATION_SERVER_CONFIG ?

What does this config param do?

I see it referenced / used in some samples and here (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-67%3A+Queryable+state+for+Kafka+Streams
)

Re: APPLICATION_SERVER_CONFIG ?

Posted by Michael Noll <mi...@confluent.io>.

Yes, agreed -- the re-thinking pre-existing notions is a big part of such
conversations.  A bit like making the mental switch from object-oriented
programming to functional programming -- and, just like in this case,
neither is more "right" than the other.  Personal
opinion/preference/context matters a lot, hence I tried to carefully phrase
my answer in a way that it doesn't come across as potentially
indoctrinating. ;-)

On Fri, Mar 24, 2017 at 6:34 PM, Jon Yeargers <jo...@cedexis.com>
wrote:

> You make some great cases for your architecture. To be clear - Ive been
> proselytizing for kafka since I joined this company last year. I think my
> largest issue is rethinking some preexisting notions about streaming to
> make them work in the kstream universe.
>
> On Fri, Mar 24, 2017 at 6:07 AM, Michael Noll <mi...@confluent.io>
> wrote:
>
> > > If I understand this correctly: assuming I have a simple aggregator
> > > distributed across n-docker instances each instance will _also_ need to
> > > support some sort of communications process for allowing access to its
> > > statestore (last param from KStream.groupby.aggregate).
> >
> > Yes.
> >
> > See
> > http://docs.confluent.io/current/streams/developer-
> > guide.html#your-application-and-interactive-queries
> > .
> >
> > > - The tombstoning facilities of redis or C* would lend themselves well
> to
> > > implementing a 'true' rolling aggregation
> >
> > What is a 'true' rolling aggregation, and how could Redis or C* help with
> > that in a way that Kafka can't?  (Honest question.)
> >
> >
> > > I get that RocksDB has a small footprint but given the choice of
> > > implementing my own RPC / gossip-like process for data sharing and
> using
> > a
> > > well tested one (ala C* or redis) I would almost always opt for the
> > latter.
> > > [...]
> > > Just my $0.02. I would love to hear why Im missing the 'big picture'.
> The
> > > kstreams architecture seems rife with potential.
> >
> > One question is, for example:  can the remote/central DB of your choice
> > (Redis, C*) handle as many qps as Kafka/Kafka Streams/RocksDB can handle?
> > Over the network?  At the same low latency?  Also, what happens if the
> > remote DB is unavailable?  Do you wait and retry?  Discard?  Accept the
> > fact that your app's processing latency will now go through the roof?  I
> > wrote about some such scenarios at
> > https://www.confluent.io/blog/distributed-real-time-joins-
> > and-aggregations-on-user-activity-events-using-kafka-streams/
> > .
> >
> > One big advantage (for many use cases, not for) with Kafka/Kafka Streams
> is
> > that you can leverage fault-tolerant *local* state that may also be
> > distributed across app instances.  Local state is much more efficient and
> > faster when doing stateful processing such as joins or aggregations.  You
> > don't need to worry about an external system, whether it's up and
> running,
> > whether its version is still compatible with your app, whether it can
> scale
> > as much as your app/Kafka Streams/Kafka/the volume of your input data.
> >
> > Also, note that some users have actually opted to run hybrid setups:
> Some
> > processing output is sent to a remote data store like Cassandra (e.g. via
> > Kafka Connect), some processing output is exposed directly through
> > interactive queries.  It's not like your forced to pick only one
> approach.
> >
> >
> > > - Typical microservices would separate storing / retrieving data
> >
> > I'd rather argue that for microservices you'd oftentimes prefer to *not*
> > use a remote DB, and rather do everything inside your microservice
> whatever
> > the microservice needs to do (perhaps we could relax this to "do
> everything
> > in a way that your microservices is in full, exclusive control", i.e. it
> > doesn't necessarily need to be *inside*, but arguably it would be better
> if
> > it actually is).
> > See e.g. the article
> > https://www.confluent.io/blog/data-dichotomy-rethinking-the-
> > way-we-treat-data-and-services/
> > that lists some of the reasoning behind this school of thinking.  Again,
> > YMMV.
> >
> > Personally, I think there's no simple true/false here.  The decisions
> > depend on what you need, what your context is, etc.  Anyways, since you
> > already have some opinions for the one side, I wanted to share some food
> > for thought for the other side of the argument. :-)
> >
> > Best,
> > Michael
> >
> >
> >
> >
> >
> > On Fri, Mar 24, 2017 at 1:25 PM, Jon Yeargers <jo...@cedexis.com>
> > wrote:
> >
> > > If I understand this correctly: assuming I have a simple aggregator
> > > distributed across n-docker instances each instance will _also_ need to
> > > support some sort of communications process for allowing access to its
> > > statestore (last param from KStream.groupby.aggregate).
> > >
> > > How would one go about substituting a separated db (EG redis) for the
> > > statestore?
> > >
> > > Some advantages to decoupling:
> > > - It would seem like having a centralized process like this would
> > alleviate
> > > the need to execute multiple requests for a given kv pair (IE "who has
> > this
> > > data?" and subsequent requests to retrieve it).
> > > - it would take some pressure off of each node to maintain a large disk
> > > store
> > > - Typical microservices would separate storing / retrieving data
> > > - It would raise some eyebrows if a spec called for a mysql/nosql
> > instance
> > > to be installed with every docker container
> > > - The tombstoning facilities of redis or C* would lend themselves well
> to
> > > implementing a 'true' rolling aggregation
> > >
> > > I get that RocksDB has a small footprint but given the choice of
> > > implementing my own RPC / gossip-like process for data sharing and
> using
> > a
> > > well tested one (ala C* or redis) I would almost always opt for the
> > latter.
> > > (Footnote: Our implementations already heavily use redis/memcached for
> > > deduplication of kafka messages so it would seem a small step to use
> the
> > > same to store aggregation results.)
> > >
> > > Just my $0.02. I would love to hear why Im missing the 'big picture'.
> The
> > > kstreams architecture seems rife with potential.
> > >
> > > On Thu, Mar 23, 2017 at 3:17 PM, Matthias J. Sax <
> matthias@confluent.io>
> > > wrote:
> > >
> > > > The config does not "do" anything. It's metadata that get's
> broadcasted
> > > > to other Streams instances for IQ feature.
> > > >
> > > > See this blog post for more details:
> > > > https://www.confluent.io/blog/unifying-stream-processing-
> > > > and-interactive-queries-in-apache-kafka/
> > > >
> > > > Happy to answer any follow up question.
> > > >
> > > >
> > > > -Matthias
> > > >
> > > > On 3/23/17 11:51 AM, Jon Yeargers wrote:
> > > > > What does this config param do?
> > > > >
> > > > > I see it referenced / used in some samples and here (
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 67%3A+Queryable+state+for+Kafka+Streams
> > > > > )
> > > > >
> > > >
> > > >
> > >
> >
>

Re: APPLICATION_SERVER_CONFIG ?

Posted by Jon Yeargers <jo...@cedexis.com>.

You make some great cases for your architecture. To be clear - Ive been
proselytizing for kafka since I joined this company last year. I think my
largest issue is rethinking some preexisting notions about streaming to
make them work in the kstream universe.

On Fri, Mar 24, 2017 at 6:07 AM, Michael Noll <mi...@confluent.io> wrote:

> > If I understand this correctly: assuming I have a simple aggregator
> > distributed across n-docker instances each instance will _also_ need to
> > support some sort of communications process for allowing access to its
> > statestore (last param from KStream.groupby.aggregate).
>
> Yes.
>
> See
> http://docs.confluent.io/current/streams/developer-
> guide.html#your-application-and-interactive-queries
> .
>
> > - The tombstoning facilities of redis or C* would lend themselves well to
> > implementing a 'true' rolling aggregation
>
> What is a 'true' rolling aggregation, and how could Redis or C* help with
> that in a way that Kafka can't?  (Honest question.)
>
>
> > I get that RocksDB has a small footprint but given the choice of
> > implementing my own RPC / gossip-like process for data sharing and using
> a
> > well tested one (ala C* or redis) I would almost always opt for the
> latter.
> > [...]
> > Just my $0.02. I would love to hear why Im missing the 'big picture'. The
> > kstreams architecture seems rife with potential.
>
> One question is, for example:  can the remote/central DB of your choice
> (Redis, C*) handle as many qps as Kafka/Kafka Streams/RocksDB can handle?
> Over the network?  At the same low latency?  Also, what happens if the
> remote DB is unavailable?  Do you wait and retry?  Discard?  Accept the
> fact that your app's processing latency will now go through the roof?  I
> wrote about some such scenarios at
> https://www.confluent.io/blog/distributed-real-time-joins-
> and-aggregations-on-user-activity-events-using-kafka-streams/
> .
>
> One big advantage (for many use cases, not for) with Kafka/Kafka Streams is
> that you can leverage fault-tolerant *local* state that may also be
> distributed across app instances.  Local state is much more efficient and
> faster when doing stateful processing such as joins or aggregations.  You
> don't need to worry about an external system, whether it's up and running,
> whether its version is still compatible with your app, whether it can scale
> as much as your app/Kafka Streams/Kafka/the volume of your input data.
>
> Also, note that some users have actually opted to run hybrid setups:  Some
> processing output is sent to a remote data store like Cassandra (e.g. via
> Kafka Connect), some processing output is exposed directly through
> interactive queries.  It's not like your forced to pick only one approach.
>
>
> > - Typical microservices would separate storing / retrieving data
>
> I'd rather argue that for microservices you'd oftentimes prefer to *not*
> use a remote DB, and rather do everything inside your microservice whatever
> the microservice needs to do (perhaps we could relax this to "do everything
> in a way that your microservices is in full, exclusive control", i.e. it
> doesn't necessarily need to be *inside*, but arguably it would be better if
> it actually is).
> See e.g. the article
> https://www.confluent.io/blog/data-dichotomy-rethinking-the-
> way-we-treat-data-and-services/
> that lists some of the reasoning behind this school of thinking.  Again,
> YMMV.
>
> Personally, I think there's no simple true/false here.  The decisions
> depend on what you need, what your context is, etc.  Anyways, since you
> already have some opinions for the one side, I wanted to share some food
> for thought for the other side of the argument. :-)
>
> Best,
> Michael
>
>
>
>
>
> On Fri, Mar 24, 2017 at 1:25 PM, Jon Yeargers <jo...@cedexis.com>
> wrote:
>
> > If I understand this correctly: assuming I have a simple aggregator
> > distributed across n-docker instances each instance will _also_ need to
> > support some sort of communications process for allowing access to its
> > statestore (last param from KStream.groupby.aggregate).
> >
> > How would one go about substituting a separated db (EG redis) for the
> > statestore?
> >
> > Some advantages to decoupling:
> > - It would seem like having a centralized process like this would
> alleviate
> > the need to execute multiple requests for a given kv pair (IE "who has
> this
> > data?" and subsequent requests to retrieve it).
> > - it would take some pressure off of each node to maintain a large disk
> > store
> > - Typical microservices would separate storing / retrieving data
> > - It would raise some eyebrows if a spec called for a mysql/nosql
> instance
> > to be installed with every docker container
> > - The tombstoning facilities of redis or C* would lend themselves well to
> > implementing a 'true' rolling aggregation
> >
> > I get that RocksDB has a small footprint but given the choice of
> > implementing my own RPC / gossip-like process for data sharing and using
> a
> > well tested one (ala C* or redis) I would almost always opt for the
> latter.
> > (Footnote: Our implementations already heavily use redis/memcached for
> > deduplication of kafka messages so it would seem a small step to use the
> > same to store aggregation results.)
> >
> > Just my $0.02. I would love to hear why Im missing the 'big picture'. The
> > kstreams architecture seems rife with potential.
> >
> > On Thu, Mar 23, 2017 at 3:17 PM, Matthias J. Sax <ma...@confluent.io>
> > wrote:
> >
> > > The config does not "do" anything. It's metadata that get's broadcasted
> > > to other Streams instances for IQ feature.
> > >
> > > See this blog post for more details:
> > > https://www.confluent.io/blog/unifying-stream-processing-
> > > and-interactive-queries-in-apache-kafka/
> > >
> > > Happy to answer any follow up question.
> > >
> > >
> > > -Matthias
> > >
> > > On 3/23/17 11:51 AM, Jon Yeargers wrote:
> > > > What does this config param do?
> > > >
> > > > I see it referenced / used in some samples and here (
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 67%3A+Queryable+state+for+Kafka+Streams
> > > > )
> > > >
> > >
> > >
> >
>

Re: APPLICATION_SERVER_CONFIG ?

Posted by Michael Noll <mi...@confluent.io>.

> If I understand this correctly: assuming I have a simple aggregator
> distributed across n-docker instances each instance will _also_ need to
> support some sort of communications process for allowing access to its
> statestore (last param from KStream.groupby.aggregate).

Yes.

See
http://docs.confluent.io/current/streams/developer-guide.html#your-application-and-interactive-queries
.

> - The tombstoning facilities of redis or C* would lend themselves well to
> implementing a 'true' rolling aggregation

What is a 'true' rolling aggregation, and how could Redis or C* help with
that in a way that Kafka can't?  (Honest question.)

> I get that RocksDB has a small footprint but given the choice of
> implementing my own RPC / gossip-like process for data sharing and using a
> well tested one (ala C* or redis) I would almost always opt for the
latter.
> [...]
> Just my $0.02. I would love to hear why Im missing the 'big picture'. The
> kstreams architecture seems rife with potential.

One question is, for example:  can the remote/central DB of your choice
(Redis, C*) handle as many qps as Kafka/Kafka Streams/RocksDB can handle?
Over the network?  At the same low latency?  Also, what happens if the
remote DB is unavailable?  Do you wait and retry?  Discard?  Accept the
fact that your app's processing latency will now go through the roof?  I
wrote about some such scenarios at
https://www.confluent.io/blog/distributed-real-time-joins-and-aggregations-on-user-activity-events-using-kafka-streams/
.

One big advantage (for many use cases, not for) with Kafka/Kafka Streams is
that you can leverage fault-tolerant *local* state that may also be
distributed across app instances.  Local state is much more efficient and
faster when doing stateful processing such as joins or aggregations.  You
don't need to worry about an external system, whether it's up and running,
whether its version is still compatible with your app, whether it can scale
as much as your app/Kafka Streams/Kafka/the volume of your input data.

Also, note that some users have actually opted to run hybrid setups:  Some
processing output is sent to a remote data store like Cassandra (e.g. via
Kafka Connect), some processing output is exposed directly through
interactive queries.  It's not like your forced to pick only one approach.

> - Typical microservices would separate storing / retrieving data

I'd rather argue that for microservices you'd oftentimes prefer to *not*
use a remote DB, and rather do everything inside your microservice whatever
the microservice needs to do (perhaps we could relax this to "do everything
in a way that your microservices is in full, exclusive control", i.e. it
doesn't necessarily need to be *inside*, but arguably it would be better if
it actually is).
See e.g. the article
https://www.confluent.io/blog/data-dichotomy-rethinking-the-way-we-treat-data-and-services/
that lists some of the reasoning behind this school of thinking.  Again,
YMMV.

Personally, I think there's no simple true/false here.  The decisions
depend on what you need, what your context is, etc.  Anyways, since you
already have some opinions for the one side, I wanted to share some food
for thought for the other side of the argument. :-)

Best,
Michael

On Fri, Mar 24, 2017 at 1:25 PM, Jon Yeargers <jo...@cedexis.com>
wrote:

> If I understand this correctly: assuming I have a simple aggregator
> distributed across n-docker instances each instance will _also_ need to
> support some sort of communications process for allowing access to its
> statestore (last param from KStream.groupby.aggregate).
>
> How would one go about substituting a separated db (EG redis) for the
> statestore?
>
> Some advantages to decoupling:
> - It would seem like having a centralized process like this would alleviate
> the need to execute multiple requests for a given kv pair (IE "who has this
> data?" and subsequent requests to retrieve it).
> - it would take some pressure off of each node to maintain a large disk
> store
> - Typical microservices would separate storing / retrieving data
> - It would raise some eyebrows if a spec called for a mysql/nosql instance
> to be installed with every docker container
> - The tombstoning facilities of redis or C* would lend themselves well to
> implementing a 'true' rolling aggregation
>
> I get that RocksDB has a small footprint but given the choice of
> implementing my own RPC / gossip-like process for data sharing and using a
> well tested one (ala C* or redis) I would almost always opt for the latter.
> (Footnote: Our implementations already heavily use redis/memcached for
> deduplication of kafka messages so it would seem a small step to use the
> same to store aggregation results.)
>
> Just my $0.02. I would love to hear why Im missing the 'big picture'. The
> kstreams architecture seems rife with potential.
>
> On Thu, Mar 23, 2017 at 3:17 PM, Matthias J. Sax <ma...@confluent.io>
> wrote:
>
> > The config does not "do" anything. It's metadata that get's broadcasted
> > to other Streams instances for IQ feature.
> >
> > See this blog post for more details:
> > https://www.confluent.io/blog/unifying-stream-processing-
> > and-interactive-queries-in-apache-kafka/
> >
> > Happy to answer any follow up question.
> >
> >
> > -Matthias
> >
> > On 3/23/17 11:51 AM, Jon Yeargers wrote:
> > > What does this config param do?
> > >
> > > I see it referenced / used in some samples and here (
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 67%3A+Queryable+state+for+Kafka+Streams
> > > )
> > >
> >
> >
>

Re: APPLICATION_SERVER_CONFIG ?

Posted by Jon Yeargers <jo...@cedexis.com>.

If I understand this correctly: assuming I have a simple aggregator
distributed across n-docker instances each instance will _also_ need to
support some sort of communications process for allowing access to its
statestore (last param from KStream.groupby.aggregate).

How would one go about substituting a separated db (EG redis) for the
statestore?

Some advantages to decoupling:
- It would seem like having a centralized process like this would alleviate
the need to execute multiple requests for a given kv pair (IE "who has this
data?" and subsequent requests to retrieve it).
- it would take some pressure off of each node to maintain a large disk
store
- Typical microservices would separate storing / retrieving data
- It would raise some eyebrows if a spec called for a mysql/nosql instance
to be installed with every docker container
- The tombstoning facilities of redis or C* would lend themselves well to
implementing a 'true' rolling aggregation

I get that RocksDB has a small footprint but given the choice of
implementing my own RPC / gossip-like process for data sharing and using a
well tested one (ala C* or redis) I would almost always opt for the latter.
(Footnote: Our implementations already heavily use redis/memcached for
deduplication of kafka messages so it would seem a small step to use the
same to store aggregation results.)

Just my $0.02. I would love to hear why Im missing the 'big picture'. The
kstreams architecture seems rife with potential.

On Thu, Mar 23, 2017 at 3:17 PM, Matthias J. Sax <ma...@confluent.io>
wrote:

> The config does not "do" anything. It's metadata that get's broadcasted
> to other Streams instances for IQ feature.
>
> See this blog post for more details:
> https://www.confluent.io/blog/unifying-stream-processing-
> and-interactive-queries-in-apache-kafka/
>
> Happy to answer any follow up question.
>
>
> -Matthias
>
> On 3/23/17 11:51 AM, Jon Yeargers wrote:
> > What does this config param do?
> >
> > I see it referenced / used in some samples and here (
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 67%3A+Queryable+state+for+Kafka+Streams
> > )
> >
>
>

Re: APPLICATION_SERVER_CONFIG ?

Posted by "Matthias J. Sax" <ma...@confluent.io>.

The config does not "do" anything. It's metadata that get's broadcasted
to other Streams instances for IQ feature.

See this blog post for more details:
https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/

Happy to answer any follow up question.


-Matthias

On 3/23/17 11:51 AM, Jon Yeargers wrote:
> What does this config param do?
> 
> I see it referenced / used in some samples and here (
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-67%3A+Queryable+state+for+Kafka+Streams
> )
>