You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by "Parthasarathy, Mohan" <mp...@hpe.com> on 2021/04/28 17:30:45 UTC

Spark Streams vs Kafka Streams

Hi,

Whenever the discussion about what streaming framework to use for near-realtime analytics, there is normally a discussion about Spark vs Kafka streaming. One of the points in favor of Spark streaming is the simple aggregations that are built-in. See here: https://sparkbyexamples.com/spark/spark-sql-aggregate-functions/. When it comes to Kafka streams, there is boilerplate code for some of them. Is there any reason why it is not provided as part of the library ? I am unable to find any discussion on this topic. Are there any plans to provide such features in the Kafka streaming library ?

Thanks
Mohan

Re: Spark Streams vs Kafka Streams

Posted by "Parthasarathy, Mohan" <mp...@hpe.com>.

Matthias,

Once a Spark dataframe is created by reading the data from Kafka (https://sparkbyexamples.com/spark/spark-streaming-with-kafka/) , you can use Spark SQL and all the aggregations that are shown in this page are valid. I feel that having this built into Kafka streams library would make it very easy.

Thanks
Mohan


On 4/28/21, 12:00 PM, "Matthias J. Sax" <mj...@apache.org> wrote:

    I am not familiar with all the details about Spark, however, the link
    you shared is for Spark SQL. I thought Spark SQL is for batch processing
    only?
    
    Personally, I would be open to add more built-in aggregations next to
    count(). It did not come up in the community so far, so there was no
    investment yet.
    
    
    -Matthias
    
    On 4/28/21 10:30 AM, Parthasarathy, Mohan wrote:
    > Hi,
    > 
    > Whenever the discussion about what streaming framework to use for near-realtime analytics, there is normally a discussion about Spark vs Kafka streaming. One of the points in favor of Spark streaming is the simple aggregations that are built-in. See here: https://sparkbyexamples.com/spark/spark-sql-aggregate-functions/ . When it comes to Kafka streams, there is boilerplate code for some of them. Is there any reason why it is not provided as part of the library ? I am unable to find any discussion on this topic. Are there any plans to provide such features in the Kafka streaming library ?
    > 
    > Thanks
    > Mohan
    >

Re: Spark Streams vs Kafka Streams

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

"I'd assume this is because Kafka Streams is positioned for
building streaming applications, rather than doing analytics, whereas Spark
is more often used for analytics purposes."

Well not necessarily  the full picture. Spark can do both analytics and
streaming, especially with Spark Structured Streaming. Spark Structured
Streaming is the Apache Spark API that lets you express computation on
streaming data *in the same way you express a batch computation on static
data.* That is the strength of Spark. Spark supports Java, Scala and Python
among others. Python or more specifically Pyspark is particularly popular
with Data Science plus the conventional analytics.

Structured Streaming Programming Guide - Spark 3.1.1 Documentation
(apache.org)
<https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>

There are two scenarios with Spark Structured Streaming.  There are called
*foreach* and *foreachBatch* operations allow you to apply arbitrary
operations and write logic on the output of a streaming query. They have
slightly different use cases - w*hile **foreach** allows custom write logic
on every row,* *foreachBatch** allows arbitrary operations and custom logic
on the output of each micro-batch*.

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Wed, 28 Apr 2021 at 20:12, Andrew Otto <ot...@wikimedia.org> wrote:

> I'd assume this is because Kafka Streams is positioned for building
> streaming applications, rather than doing analytics, whereas Spark is more
> often used for analytics purposes.
>

Re: Spark Streams vs Kafka Streams

Posted by Liam Clarke-Hutchinson <li...@adscale.co.nz>.

Spark Structured Streaming has some significant limitations compared to
Kafka Streams.

This one has always proved hard to overcome:

"Multiple streaming aggregations (i.e. a chain of aggregations on a
streaming DF) are not yet supported on streaming Datasets."





On Thu, 29 Apr. 2021, 8:13 am Parthasarathy, Mohan, <mp...@hpe.com>
wrote:

> Matthias,
>
> I will create a KIP or ticket for tracking this issue.
>
> -thanks
> Mohan
>
>
> On 4/28/21, 1:01 PM, "Matthias J. Sax" <mj...@apache.org> wrote:
>
>     Feel free to do a KIP and contribute to Kafka!
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
>
>     Or create a ticket for tracking.
>
>
>     -Matthias
>
>     On 4/28/21 12:49 PM, Parthasarathy, Mohan wrote:
>     > Andrew,
>     >
>     > I am not sure I understand. We have built several analytics
> applications. We typically use custom aggregations as they are not
> available directly in the library.
>     >
>     > -mohan
>     >
>     >
>     > On 4/28/21, 12:12 PM, "Andrew Otto" <ot...@wikimedia.org> wrote:
>     >
>     >     I'd assume this is because Kafka Streams is positioned for
> building
>     >     streaming applications, rather than doing analytics, whereas
> Spark is more
>     >     often used for analytics purposes.
>     >
>     >
>
>
>

Re: Spark Streams vs Kafka Streams

Posted by "Parthasarathy, Mohan" <mp...@hpe.com>.

Matthias,

I will create a KIP or ticket for tracking this issue.

-thanks
Mohan


On 4/28/21, 1:01 PM, "Matthias J. Sax" <mj...@apache.org> wrote:

    Feel free to do a KIP and contribute to Kafka!
    
    https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals 
    
    Or create a ticket for tracking.
    
    
    -Matthias
    
    On 4/28/21 12:49 PM, Parthasarathy, Mohan wrote:
    > Andrew,
    > 
    > I am not sure I understand. We have built several analytics applications. We typically use custom aggregations as they are not available directly in the library. 
    > 
    > -mohan
    > 
    > 
    > On 4/28/21, 12:12 PM, "Andrew Otto" <ot...@wikimedia.org> wrote:
    > 
    >     I'd assume this is because Kafka Streams is positioned for building
    >     streaming applications, rather than doing analytics, whereas Spark is more
    >     often used for analytics purposes.
    >     
    >

Re: Spark Streams vs Kafka Streams

Posted by "Matthias J. Sax" <mj...@apache.org>.

Feel free to do a KIP and contribute to Kafka!

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

Or create a ticket for tracking.


-Matthias

On 4/28/21 12:49 PM, Parthasarathy, Mohan wrote:
> Andrew,
> 
> I am not sure I understand. We have built several analytics applications. We typically use custom aggregations as they are not available directly in the library. 
> 
> -mohan
> 
> 
> On 4/28/21, 12:12 PM, "Andrew Otto" <ot...@wikimedia.org> wrote:
> 
>     I'd assume this is because Kafka Streams is positioned for building
>     streaming applications, rather than doing analytics, whereas Spark is more
>     often used for analytics purposes.
>     
>

Re: Spark Streams vs Kafka Streams

Posted by Andrew Otto <ot...@wikimedia.org>.

> I am not sure I understand. We have built several analytics applications.
We typically use custom aggregations as they are not available directly in
the library.

Oh for sure!  I was answering this question:
> . Is there any reason why it is not provided as part of the library ?

And assuming that the reason was mainly that the developers building Kafka
Streams aren't typically targeting analytics use cases in the same way that
Spark is.  Not that there is any reason those aggregations should not be in
Kafka Streams, I'm sure that would be great! :)

On Wed, Apr 28, 2021 at 3:50 PM Parthasarathy, Mohan <mp...@hpe.com>
wrote:

> Andrew,
>
> I am not sure I understand. We have built several analytics applications.
> We typically use custom aggregations as they are not available directly in
> the library.
>
> -mohan
>
>
> On 4/28/21, 12:12 PM, "Andrew Otto" <ot...@wikimedia.org> wrote:
>
>     I'd assume this is because Kafka Streams is positioned for building
>     streaming applications, rather than doing analytics, whereas Spark is
> more
>     often used for analytics purposes.
>
>
>

Re: Spark Streams vs Kafka Streams

Posted by "Parthasarathy, Mohan" <mp...@hpe.com>.

Andrew,

I am not sure I understand. We have built several analytics applications. We typically use custom aggregations as they are not available directly in the library. 

-mohan


On 4/28/21, 12:12 PM, "Andrew Otto" <ot...@wikimedia.org> wrote:

    I'd assume this is because Kafka Streams is positioned for building
    streaming applications, rather than doing analytics, whereas Spark is more
    often used for analytics purposes.

Re: Spark Streams vs Kafka Streams

Posted by Andrew Otto <ot...@wikimedia.org>.

I'd assume this is because Kafka Streams is positioned for building
streaming applications, rather than doing analytics, whereas Spark is more
often used for analytics purposes.

Re: Spark Streams vs Kafka Streams

Posted by "Matthias J. Sax" <mj...@apache.org>.

I am not familiar with all the details about Spark, however, the link
you shared is for Spark SQL. I thought Spark SQL is for batch processing
only?

Personally, I would be open to add more built-in aggregations next to
count(). It did not come up in the community so far, so there was no
investment yet.

-Matthias

On 4/28/21 10:30 AM, Parthasarathy, Mohan wrote:
> Hi,
> 
> Whenever the discussion about what streaming framework to use for near-realtime analytics, there is normally a discussion about Spark vs Kafka streaming. One of the points in favor of Spark streaming is the simple aggregations that are built-in. See here: https://sparkbyexamples.com/spark/spark-sql-aggregate-functions/. When it comes to Kafka streams, there is boilerplate code for some of them. Is there any reason why it is not provided as part of the library ? I am unable to find any discussion on this topic. Are there any plans to provide such features in the Kafka streaming library ?
> 
> Thanks
> Mohan
>