You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Davood Rafiei <ra...@gmail.com> on 2016/07/26 17:42:07 UTC

Kafka Streams multi-node

Hi,

I am newbie in Kafka and Kafka-Streams. I read documentation to get
information how it works in multi-node environment. As a result I want to
run streams library on cluster that consists of more than one node.
From what I understood, I try to resolve the following conflicts:
- Streams is a standalone java application.So it runs in a single node, of
n-node cluster of kafka.
- However, streams runs on top of kafka, and if we set a multi-broker kafka
cluster, and then run streams library from master node, then streams
library will run in entire cluster.

So, streams library is standalone java application but to force it to run
in multiple nodes, do we need to do something extra (in configuration for
example) if we have already kafka running in multi-broker mode?


Thanks
Davood

Re: Kafka Streams multi-node

Posted by "Matthias J. Sax" <ma...@confluent.io>.
See inline.

-Matthias


On 07/26/2016 11:10 PM, Davood Rafiei wrote:
> Thanks David and Matthias for reply. To make sure that I understand
> correctly:
> - Each stream application is limited to only one node. In that node all
> stream execution DAG is processed.

An application is not limited to a single node; however, an application
*instance* is. I guess you understood it correctly anyway (just want to
make sure, we use the same terminology)

> - If we want to parallelize our application, we start new instance of
> streams application, which will be similar to previous one (it will run all
> DAG operators inside of that node) and after adding new application,  it
> gets separate partition to process (the one that no other stream
> application is processing).

Yes.

> - There is no topology to break the DAG and operators (of stream
> application) into separate nodes (of cluster) and make it look like more
> "dala flow"ish. Instead we have "little" end-to-end instances running on
> each cluster node.

Yes. Just want to point out, that this is similar to other data flow
systems (eg, Flink).

Furthermore, you could manually accomplish the split, by splitting your
application into multiple applications (where the output of your first
application part serves as input for the second and so on...). However,
I personally do not see a big advantage in splitting the application.

> - Assume we run some aggregate operator with time windows of finance data
> stream. Then we can have only one partition and only one streams
> application, as increasing partitions and/or streams applications can cause
> problems to the logic of particular use case.

Yes. Kafka Streams parallelism is limited by the number of partitions of
your input topics. If you can partition your data depends on your use
case -- and if the use case does not allow data parallel processing you
are in a bad position with regard to scaling

(This is a general limitation and independent of the tool you are using...)

> 
> Please correct me if I am wrong.
> 
> Thanks
> Davood
> 
> On Tue, Jul 26, 2016 at 9:44 PM, Matthias J. Sax <ma...@confluent.io>
> wrote:
> 
>> David's answer is correct. Just start the same application multiple
>> times on different nodes and the library does the rest for you.
>>
>> Just one addition: as Kafka Streams is for standard application
>> development, there is no need to run the application on the same nodes
>> as your brokers are running (ie, applications instances could run on any
>> machine outside of your broker cluster).
>>
>> Of course, it is possible to run the application on broker nodes. Just
>> wanted to point out, that there is no co-location required between
>> brokers and app instances.
>>
>> -Matthias
>>
>>
>> On 07/26/2016 08:08 PM, David Garcia wrote:
>>>
>> http://docs.confluent.io/3.0.0/streams/architecture.html#parallelism-model
>>>
>>> you shouldn’t have to do anything.  Simply starting a new thread will
>> “rebalance” your streaming job.  The job coordinates with tasks through
>> kafka itself.
>>>
>>>
>>>
>>> On 7/26/16, 12:42 PM, "Davood Rafiei" <ra...@gmail.com> wrote:
>>>
>>>     Hi,
>>>
>>>     I am newbie in Kafka and Kafka-Streams. I read documentation to get
>>>     information how it works in multi-node environment. As a result I
>> want to
>>>     run streams library on cluster that consists of more than one node.
>>>     From what I understood, I try to resolve the following conflicts:
>>>     - Streams is a standalone java application.So it runs in a single
>> node, of
>>>     n-node cluster of kafka.
>>>     - However, streams runs on top of kafka, and if we set a
>> multi-broker kafka
>>>     cluster, and then run streams library from master node, then streams
>>>     library will run in entire cluster.
>>>
>>>     So, streams library is standalone java application but to force it
>> to run
>>>     in multiple nodes, do we need to do something extra (in
>> configuration for
>>>     example) if we have already kafka running in multi-broker mode?
>>>
>>>
>>>     Thanks
>>>     Davood
>>>
>>>
>>
>>
> 


Re: Kafka Streams multi-node

Posted by Davood Rafiei <ra...@gmail.com>.
Thanks David and Matthias for reply. To make sure that I understand
correctly:
- Each stream application is limited to only one node. In that node all
stream execution DAG is processed.
- If we want to parallelize our application, we start new instance of
streams application, which will be similar to previous one (it will run all
DAG operators inside of that node) and after adding new application,  it
gets separate partition to process (the one that no other stream
application is processing).
- There is no topology to break the DAG and operators (of stream
application) into separate nodes (of cluster) and make it look like more
"dala flow"ish. Instead we have "little" end-to-end instances running on
each cluster node.
- Assume we run some aggregate operator with time windows of finance data
stream. Then we can have only one partition and only one streams
application, as increasing partitions and/or streams applications can cause
problems to the logic of particular use case.

Please correct me if I am wrong.

Thanks
Davood

On Tue, Jul 26, 2016 at 9:44 PM, Matthias J. Sax <ma...@confluent.io>
wrote:

> David's answer is correct. Just start the same application multiple
> times on different nodes and the library does the rest for you.
>
> Just one addition: as Kafka Streams is for standard application
> development, there is no need to run the application on the same nodes
> as your brokers are running (ie, applications instances could run on any
> machine outside of your broker cluster).
>
> Of course, it is possible to run the application on broker nodes. Just
> wanted to point out, that there is no co-location required between
> brokers and app instances.
>
> -Matthias
>
>
> On 07/26/2016 08:08 PM, David Garcia wrote:
> >
> http://docs.confluent.io/3.0.0/streams/architecture.html#parallelism-model
> >
> > you shouldn’t have to do anything.  Simply starting a new thread will
> “rebalance” your streaming job.  The job coordinates with tasks through
> kafka itself.
> >
> >
> >
> > On 7/26/16, 12:42 PM, "Davood Rafiei" <ra...@gmail.com> wrote:
> >
> >     Hi,
> >
> >     I am newbie in Kafka and Kafka-Streams. I read documentation to get
> >     information how it works in multi-node environment. As a result I
> want to
> >     run streams library on cluster that consists of more than one node.
> >     From what I understood, I try to resolve the following conflicts:
> >     - Streams is a standalone java application.So it runs in a single
> node, of
> >     n-node cluster of kafka.
> >     - However, streams runs on top of kafka, and if we set a
> multi-broker kafka
> >     cluster, and then run streams library from master node, then streams
> >     library will run in entire cluster.
> >
> >     So, streams library is standalone java application but to force it
> to run
> >     in multiple nodes, do we need to do something extra (in
> configuration for
> >     example) if we have already kafka running in multi-broker mode?
> >
> >
> >     Thanks
> >     Davood
> >
> >
>
>

Re: Kafka Streams multi-node

Posted by "Matthias J. Sax" <ma...@confluent.io>.
David's answer is correct. Just start the same application multiple
times on different nodes and the library does the rest for you.

Just one addition: as Kafka Streams is for standard application
development, there is no need to run the application on the same nodes
as your brokers are running (ie, applications instances could run on any
machine outside of your broker cluster).

Of course, it is possible to run the application on broker nodes. Just
wanted to point out, that there is no co-location required between
brokers and app instances.

-Matthias


On 07/26/2016 08:08 PM, David Garcia wrote:
> http://docs.confluent.io/3.0.0/streams/architecture.html#parallelism-model
> 
> you shouldn’t have to do anything.  Simply starting a new thread will “rebalance” your streaming job.  The job coordinates with tasks through kafka itself.
> 
> 
> 
> On 7/26/16, 12:42 PM, "Davood Rafiei" <ra...@gmail.com> wrote:
> 
>     Hi,
>     
>     I am newbie in Kafka and Kafka-Streams. I read documentation to get
>     information how it works in multi-node environment. As a result I want to
>     run streams library on cluster that consists of more than one node.
>     From what I understood, I try to resolve the following conflicts:
>     - Streams is a standalone java application.So it runs in a single node, of
>     n-node cluster of kafka.
>     - However, streams runs on top of kafka, and if we set a multi-broker kafka
>     cluster, and then run streams library from master node, then streams
>     library will run in entire cluster.
>     
>     So, streams library is standalone java application but to force it to run
>     in multiple nodes, do we need to do something extra (in configuration for
>     example) if we have already kafka running in multi-broker mode?
>     
>     
>     Thanks
>     Davood
>     
> 


Re: Kafka Streams multi-node

Posted by David Garcia <da...@spiceworks.com>.
http://docs.confluent.io/3.0.0/streams/architecture.html#parallelism-model

you shouldn’t have to do anything.  Simply starting a new thread will “rebalance” your streaming job.  The job coordinates with tasks through kafka itself.



On 7/26/16, 12:42 PM, "Davood Rafiei" <ra...@gmail.com> wrote:

    Hi,
    
    I am newbie in Kafka and Kafka-Streams. I read documentation to get
    information how it works in multi-node environment. As a result I want to
    run streams library on cluster that consists of more than one node.
    From what I understood, I try to resolve the following conflicts:
    - Streams is a standalone java application.So it runs in a single node, of
    n-node cluster of kafka.
    - However, streams runs on top of kafka, and if we set a multi-broker kafka
    cluster, and then run streams library from master node, then streams
    library will run in entire cluster.
    
    So, streams library is standalone java application but to force it to run
    in multiple nodes, do we need to do something extra (in configuration for
    example) if we have already kafka running in multi-broker mode?
    
    
    Thanks
    Davood