You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2017/10/11 07:31:19 UTC

Kafka streams vs Spark streaming

Hi,

Has anyone had an experience of using Kafka streams versus Spark?

I am not familiar with Kafka streams concept except that it is a set of
libraries.

Any feedback will be appreciated.

Regards,

Mich



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Kafka streams vs Spark streaming

Posted by Sachin Mittal <sj...@gmail.com>.

No it wont work this way.
Say you have 9 partitions and 3 instances.
1 = {1, 2, 3}
2 = {4, 5, 6}
3 = (7, 8, 9}
And lets say a particular key (k1) is always written to partition 4.
Now say you increase partitions to 12 you may have:
1 = {1, 2, 3, 4}
2 = {5, 6, 7, 8}
3 = (9, 10, 11, 12}

Now it is possible that k1 is still written to 4 and now processed by
instance 1 or may be redistributed to some new partition say 9 and
processed by machine 3.
So we would have some old data for that key written to partition 4 and new
data after added new partitions written to partition 9.

Hence some aggregation for time window spanning both the partitions would
report incorrect results.

As a result it is general practice that one creates slightly more
partitions than anticipated data volume. Hence kafka topics and streaming
is not as elastic as spark.



On Wed, Oct 11, 2017 at 1:56 PM, Sabarish Sasidharan <sabarish.spk@gmail.com
> wrote:

> @Sachin
> >>is not elastic. You need to anticipate before hand on volume of data you
> will have. Very difficult to add and reduce topic partitions later on.
>
> Why do you say so Sachin? Kafka Streams will readjust once we add more
> partitions to the Kafka topic. And when we add more machines, rebalancing
> auto distributes the partitions among the new stream threads.
>
> Regards
> Sab
>
> On 11 Oct 2017 1:44 pm, "Sachin Mittal" <sj...@gmail.com> wrote:
>
>> Kafka streams has a lower learning curve and if your source data is in
>> kafka topics it is pretty simple to integrate it with.
>> It can run like a library inside your main programs.
>>
>> So as compared to spark streams
>> 1. Is much simpler to implement.
>> 2. Is not much heavy on hardware unlike spark.
>>
>>
>> On the downside
>> 1. It is not elastic. You need to anticipate before hand on volume of
>> data you will have. Very difficult to add and reduce topic partitions later
>> on.
>> 2. The partition key is very important if you need to run multiple
>> instances of streams application and certain instance processing certain
>> partitions only.
>>      In case you need aggregation on a different key you may need to
>> re-partition the data to a new topic and run new streams app against that.
>>
>> So yes if you have good idea about your data and if it comes from kafka
>> and you want to build something quick without much hardware kafka streams
>> is a way to go.
>>
>> We had first tried spark streaming but given hardware limitation and
>> complexity of fetching data from mongodb we decided kafka streams as way to
>> go forward.
>>
>> Thanks
>> Sachin
>>
>>
>>
>>
>>
>> On Wed, Oct 11, 2017 at 1:01 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Has anyone had an experience of using Kafka streams versus Spark?
>>>
>>> I am not familiar with Kafka streams concept except that it is a set of
>>> libraries.
>>>
>>> Any feedback will be appreciated.
>>>
>>> Regards,
>>>
>>> Mich
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>

Re: Kafka streams vs Spark streaming

Posted by Sabarish Sasidharan <sa...@gmail.com>.

@Sachin
>>is not elastic. You need to anticipate before hand on volume of data you
will have. Very difficult to add and reduce topic partitions later on.

Why do you say so Sachin? Kafka Streams will readjust once we add more
partitions to the Kafka topic. And when we add more machines, rebalancing
auto distributes the partitions among the new stream threads.

Regards
Sab

On 11 Oct 2017 1:44 pm, "Sachin Mittal" <sj...@gmail.com> wrote:

> Kafka streams has a lower learning curve and if your source data is in
> kafka topics it is pretty simple to integrate it with.
> It can run like a library inside your main programs.
>
> So as compared to spark streams
> 1. Is much simpler to implement.
> 2. Is not much heavy on hardware unlike spark.
>
>
> On the downside
> 1. It is not elastic. You need to anticipate before hand on volume of data
> you will have. Very difficult to add and reduce topic partitions later on.
> 2. The partition key is very important if you need to run multiple
> instances of streams application and certain instance processing certain
> partitions only.
>      In case you need aggregation on a different key you may need to
> re-partition the data to a new topic and run new streams app against that.
>
> So yes if you have good idea about your data and if it comes from kafka
> and you want to build something quick without much hardware kafka streams
> is a way to go.
>
> We had first tried spark streaming but given hardware limitation and
> complexity of fetching data from mongodb we decided kafka streams as way to
> go forward.
>
> Thanks
> Sachin
>
>
>
>
>
> On Wed, Oct 11, 2017 at 1:01 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Hi,
>>
>> Has anyone had an experience of using Kafka streams versus Spark?
>>
>> I am not familiar with Kafka streams concept except that it is a set of
>> libraries.
>>
>> Any feedback will be appreciated.
>>
>> Regards,
>>
>> Mich
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>

Re: Kafka streams vs Spark streaming

Posted by Sachin Mittal <sj...@gmail.com>.

Well depends upon use case. Say the metric you are evaluating is grouped by
a key and you want to parallelize the operation by adding more instances so
certain instance deal with only a particular group it is always better to
have partitioning also done on that key. This way a particular instance
will always compute upon certain partitions and hence certain keys.

So in such case you need to make sure the producers are also producing
based on that key.

Its optional yes but for good performance one needs to ensure topics are
partitioned based on key hashes.

In spark this is not needed as it is not backed by a topic.

In short kafka streams are backed by a topic and that does create some
downside (side by side having some upsides too).


On Wed, Oct 11, 2017 at 2:00 PM, Sabarish Sasidharan <sabarish.spk@gmail.com
> wrote:

> @Sachin
> >>The partition key is very important if you need to run multiple
> instances of streams application and certain instance processing certain
> partitions only.
>
> Again, depending on partition key is optional. It's actually a feature
> enabler, so we can use local state stores to improve throughput. I don't
> see this as a downside.
>
> Regards
> Sab
>
> On 11 Oct 2017 1:44 pm, "Sachin Mittal" <sj...@gmail.com> wrote:
>
>> Kafka streams has a lower learning curve and if your source data is in
>> kafka topics it is pretty simple to integrate it with.
>> It can run like a library inside your main programs.
>>
>> So as compared to spark streams
>> 1. Is much simpler to implement.
>> 2. Is not much heavy on hardware unlike spark.
>>
>>
>> On the downside
>> 1. It is not elastic. You need to anticipate before hand on volume of
>> data you will have. Very difficult to add and reduce topic partitions later
>> on.
>> 2. The partition key is very important if you need to run multiple
>> instances of streams application and certain instance processing certain
>> partitions only.
>>      In case you need aggregation on a different key you may need to
>> re-partition the data to a new topic and run new streams app against that.
>>
>> So yes if you have good idea about your data and if it comes from kafka
>> and you want to build something quick without much hardware kafka streams
>> is a way to go.
>>
>> We had first tried spark streaming but given hardware limitation and
>> complexity of fetching data from mongodb we decided kafka streams as way to
>> go forward.
>>
>> Thanks
>> Sachin
>>
>>
>>
>>
>>
>> On Wed, Oct 11, 2017 at 1:01 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Has anyone had an experience of using Kafka streams versus Spark?
>>>
>>> I am not familiar with Kafka streams concept except that it is a set of
>>> libraries.
>>>
>>> Any feedback will be appreciated.
>>>
>>> Regards,
>>>
>>> Mich
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>

Re: Kafka streams vs Spark streaming

Posted by Sabarish Sasidharan <sa...@gmail.com>.

@Sachin
>>The partition key is very important if you need to run multiple instances
of streams application and certain instance processing certain partitions
only.

Again, depending on partition key is optional. It's actually a feature
enabler, so we can use local state stores to improve throughput. I don't
see this as a downside.

Regards
Sab

On 11 Oct 2017 1:44 pm, "Sachin Mittal" <sj...@gmail.com> wrote:

> Kafka streams has a lower learning curve and if your source data is in
> kafka topics it is pretty simple to integrate it with.
> It can run like a library inside your main programs.
>
> So as compared to spark streams
> 1. Is much simpler to implement.
> 2. Is not much heavy on hardware unlike spark.
>
>
> On the downside
> 1. It is not elastic. You need to anticipate before hand on volume of data
> you will have. Very difficult to add and reduce topic partitions later on.
> 2. The partition key is very important if you need to run multiple
> instances of streams application and certain instance processing certain
> partitions only.
>      In case you need aggregation on a different key you may need to
> re-partition the data to a new topic and run new streams app against that.
>
> So yes if you have good idea about your data and if it comes from kafka
> and you want to build something quick without much hardware kafka streams
> is a way to go.
>
> We had first tried spark streaming but given hardware limitation and
> complexity of fetching data from mongodb we decided kafka streams as way to
> go forward.
>
> Thanks
> Sachin
>
>
>
>
>
> On Wed, Oct 11, 2017 at 1:01 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Hi,
>>
>> Has anyone had an experience of using Kafka streams versus Spark?
>>
>> I am not familiar with Kafka streams concept except that it is a set of
>> libraries.
>>
>> Any feedback will be appreciated.
>>
>> Regards,
>>
>> Mich
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>

Re: Kafka streams vs Spark streaming

Posted by Sachin Mittal <sj...@gmail.com>.

Kafka streams has a lower learning curve and if your source data is in
kafka topics it is pretty simple to integrate it with.
It can run like a library inside your main programs.

So as compared to spark streams
1. Is much simpler to implement.
2. Is not much heavy on hardware unlike spark.

On the downside
1. It is not elastic. You need to anticipate before hand on volume of data
you will have. Very difficult to add and reduce topic partitions later on.
2. The partition key is very important if you need to run multiple
instances of streams application and certain instance processing certain
partitions only.
     In case you need aggregation on a different key you may need to
re-partition the data to a new topic and run new streams app against that.

So yes if you have good idea about your data and if it comes from kafka and
you want to build something quick without much hardware kafka streams is a
way to go.

We had first tried spark streaming but given hardware limitation and
complexity of fetching data from mongodb we decided kafka streams as way to
go forward.

Thanks
Sachin

On Wed, Oct 11, 2017 at 1:01 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> Has anyone had an experience of using Kafka streams versus Spark?
>
> I am not familiar with Kafka streams concept except that it is a set of
> libraries.
>
> Any feedback will be appreciated.
>
> Regards,
>
> Mich
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>