You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by kant kodali <ka...@gmail.com> on 2017/06/11 08:12:03 UTC

What is the real difference between Kafka streaming and Spark Streaming?

Hi All,

I am trying hard to figure out what is the real difference between Kafka
Streaming vs Spark Streaming other than saying one can be used as part of
Micro services (since Kafka streaming is just a library) and the other is a
Standalone framework by itself.

If I can accomplish same job one way or other this is a sort of a puzzling
question for me so it would be great to know what Spark streaming can do
that Kafka Streaming cannot do efficiently or whatever ?

Thanks!

Re: What is the real difference between Kafka streaming and Spark Streaming?

Posted by Michael Armbrust <mi...@databricks.com>.

Continuous processing is still a work in progress.  I would really like to
at least have a basic version in Spark 2.3.

The announcement about 2.2 is that we are planning to remove the
experimental tag from Structured Streaming.

On Thu, Jun 15, 2017 at 11:53 AM, kant kodali <ka...@gmail.com> wrote:

> vow! you caught the 007!  Is continuous processing mode available in 2.2?
> The ticket says the target version is 2.3 but the talk in the Video says
> 2.2 and beyond so I am just curious if it is available in 2.2 or should I
> try it from the latest build?
>
> Thanks!
>
> On Wed, Jun 14, 2017 at 5:32 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> This a good question. I really like using Kafka as a centralized source
>> for streaming data in an organization and, with Spark 2.2, we have full
>> support for reading and writing data to/from Kafka in both streaming and
>> batch
>> <https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html>.
>> I'll focus here on what I think the advantages are of Structured Streaming
>> over Kafka Streams (a stream processing library that reads from Kafka).
>>
>>  - *High level productive APIs* - Streaming queries in Spark can be
>> expressed using DataFrames, Datasets or even plain SQL.  Streaming
>> DataFrames/SQL are supported in Scala, Java, Python and even R.  This means
>> that for common operations like filtering, joining, aggregating, you can
>> use built-in operations.  For complicated custom logic you can use UDFs and
>> lambda functions. In contrast, Kafka Streams mostly requires you to express
>> your transformations using lambda functions.
>>  - *High Performance* - Since it is built on Spark SQL, streaming
>> queries take advantage of the Catalyst optimizer and the Tungsten execution
>> engine. This design leads to huge performance wins
>> <https://databricks.com/blog/2017/06/06/simple-super-fast-streaming-engine-apache-spark.html>,
>> which means you need less hardware to accomplish the same job.
>>  - *Ecosystem* - Spark has connectors for working with all kinds of data
>> stored in a variety of systems.  This means you can join a stream with data
>> encoded in parquet and stored in S3/HDFS.  Perhaps more importantly, it
>> also means that if you decide that you don't want to manage a Kafka cluster
>> anymore and would rather use Kinesis, you can do that too.  We recently
>> moved a bunch of our pipelines from Kafka to Kinesis and had to only change
>> a few lines of code! I think its likely that in the future Spark will also
>> have connectors for Google's PubSub and Azure's streaming offerings.
>>
>> Regarding latency, there has been a lot of discussion about the inherent
>> latencies of micro-batch.  Fortunately, we were very careful to leave
>> batching out of the user facing API, and as we demo'ed last week, this
>> makes it possible for the Spark Streaming to achieve sub-millisecond
>> latencies <https://www.youtube.com/watch?v=qAZ5XUz32yM>.  Watch
>> SPARK-20928 <https://issues.apache.org/jira/browse/SPARK-20928> for more
>> on this effort to eliminate micro-batch from Spark's execution model.
>>
>> At the far other end of the latency spectrum...  For those with jobs that
>> run in the cloud on data that arrives sporadically, you can run streaming
>> jobs that only execute every few hours or every few days, shutting the
>> cluster down in between.  This architecture can result in a huge cost
>> savings for some applications
>> <https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html>
>> .
>>
>> Michael
>>
>> On Sun, Jun 11, 2017 at 1:12 AM, kant kodali <ka...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I am trying hard to figure out what is the real difference between Kafka
>>> Streaming vs Spark Streaming other than saying one can be used as part of
>>> Micro services (since Kafka streaming is just a library) and the other is a
>>> Standalone framework by itself.
>>>
>>> If I can accomplish same job one way or other this is a sort of a
>>> puzzling question for me so it would be great to know what Spark streaming
>>> can do that Kafka Streaming cannot do efficiently or whatever ?
>>>
>>> Thanks!
>>>
>>>
>>
>

Re: What is the real difference between Kafka streaming and Spark Streaming?

Posted by kant kodali <ka...@gmail.com>.

vow! you caught the 007!  Is continuous processing mode available in 2.2?
The ticket says the target version is 2.3 but the talk in the Video says
2.2 and beyond so I am just curious if it is available in 2.2 or should I
try it from the latest build?

Thanks!

On Wed, Jun 14, 2017 at 5:32 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> This a good question. I really like using Kafka as a centralized source
> for streaming data in an organization and, with Spark 2.2, we have full
> support for reading and writing data to/from Kafka in both streaming and
> batch
> <https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html>.
> I'll focus here on what I think the advantages are of Structured Streaming
> over Kafka Streams (a stream processing library that reads from Kafka).
>
>  - *High level productive APIs* - Streaming queries in Spark can be
> expressed using DataFrames, Datasets or even plain SQL.  Streaming
> DataFrames/SQL are supported in Scala, Java, Python and even R.  This means
> that for common operations like filtering, joining, aggregating, you can
> use built-in operations.  For complicated custom logic you can use UDFs and
> lambda functions. In contrast, Kafka Streams mostly requires you to express
> your transformations using lambda functions.
>  - *High Performance* - Since it is built on Spark SQL, streaming queries
> take advantage of the Catalyst optimizer and the Tungsten execution engine.
> This design leads to huge performance wins
> <https://databricks.com/blog/2017/06/06/simple-super-fast-streaming-engine-apache-spark.html>,
> which means you need less hardware to accomplish the same job.
>  - *Ecosystem* - Spark has connectors for working with all kinds of data
> stored in a variety of systems.  This means you can join a stream with data
> encoded in parquet and stored in S3/HDFS.  Perhaps more importantly, it
> also means that if you decide that you don't want to manage a Kafka cluster
> anymore and would rather use Kinesis, you can do that too.  We recently
> moved a bunch of our pipelines from Kafka to Kinesis and had to only change
> a few lines of code! I think its likely that in the future Spark will also
> have connectors for Google's PubSub and Azure's streaming offerings.
>
> Regarding latency, there has been a lot of discussion about the inherent
> latencies of micro-batch.  Fortunately, we were very careful to leave
> batching out of the user facing API, and as we demo'ed last week, this
> makes it possible for the Spark Streaming to achieve sub-millisecond
> latencies <https://www.youtube.com/watch?v=qAZ5XUz32yM>.  Watch
> SPARK-20928 <https://issues.apache.org/jira/browse/SPARK-20928> for more
> on this effort to eliminate micro-batch from Spark's execution model.
>
> At the far other end of the latency spectrum...  For those with jobs that
> run in the cloud on data that arrives sporadically, you can run streaming
> jobs that only execute every few hours or every few days, shutting the
> cluster down in between.  This architecture can result in a huge cost
> savings for some applications
> <https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html>
> .
>
> Michael
>
> On Sun, Jun 11, 2017 at 1:12 AM, kant kodali <ka...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am trying hard to figure out what is the real difference between Kafka
>> Streaming vs Spark Streaming other than saying one can be used as part of
>> Micro services (since Kafka streaming is just a library) and the other is a
>> Standalone framework by itself.
>>
>> If I can accomplish same job one way or other this is a sort of a
>> puzzling question for me so it would be great to know what Spark streaming
>> can do that Kafka Streaming cannot do efficiently or whatever ?
>>
>> Thanks!
>>
>>
>

Re: What is the real difference between Kafka streaming and Spark Streaming?

Posted by Michael Armbrust <mi...@databricks.com>.

This a good question. I really like using Kafka as a centralized source for
streaming data in an organization and, with Spark 2.2, we have full support
for reading and writing data to/from Kafka in both streaming and batch
<https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html>.
I'll focus here on what I think the advantages are of Structured Streaming
over Kafka Streams (a stream processing library that reads from Kafka).

 - *High level productive APIs* - Streaming queries in Spark can be
expressed using DataFrames, Datasets or even plain SQL.  Streaming
DataFrames/SQL are supported in Scala, Java, Python and even R.  This means
that for common operations like filtering, joining, aggregating, you can
use built-in operations.  For complicated custom logic you can use UDFs and
lambda functions. In contrast, Kafka Streams mostly requires you to express
your transformations using lambda functions.
 - *High Performance* - Since it is built on Spark SQL, streaming queries
take advantage of the Catalyst optimizer and the Tungsten execution engine.
This design leads to huge performance wins
<https://databricks.com/blog/2017/06/06/simple-super-fast-streaming-engine-apache-spark.html>,
which means you need less hardware to accomplish the same job.
 - *Ecosystem* - Spark has connectors for working with all kinds of data
stored in a variety of systems.  This means you can join a stream with data
encoded in parquet and stored in S3/HDFS.  Perhaps more importantly, it
also means that if you decide that you don't want to manage a Kafka cluster
anymore and would rather use Kinesis, you can do that too.  We recently
moved a bunch of our pipelines from Kafka to Kinesis and had to only change
a few lines of code! I think its likely that in the future Spark will also
have connectors for Google's PubSub and Azure's streaming offerings.

Regarding latency, there has been a lot of discussion about the inherent
latencies of micro-batch.  Fortunately, we were very careful to leave
batching out of the user facing API, and as we demo'ed last week, this
makes it possible for the Spark Streaming to achieve sub-millisecond
latencies <https://www.youtube.com/watch?v=qAZ5XUz32yM>.  Watch SPARK-20928
<https://issues.apache.org/jira/browse/SPARK-20928> for more on this effort
to eliminate micro-batch from Spark's execution model.

At the far other end of the latency spectrum...  For those with jobs that
run in the cloud on data that arrives sporadically, you can run streaming
jobs that only execute every few hours or every few days, shutting the
cluster down in between.  This architecture can result in a huge cost
savings for some applications
<https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html>
.

Michael

On Sun, Jun 11, 2017 at 1:12 AM, kant kodali <ka...@gmail.com> wrote:

> Hi All,
>
> I am trying hard to figure out what is the real difference between Kafka
> Streaming vs Spark Streaming other than saying one can be used as part of
> Micro services (since Kafka streaming is just a library) and the other is a
> Standalone framework by itself.
>
> If I can accomplish same job one way or other this is a sort of a puzzling
> question for me so it would be great to know what Spark streaming can do
> that Kafka Streaming cannot do efficiently or whatever ?
>
> Thanks!
>
>

Re: What is the real difference between Kafka streaming and Spark Streaming?

Posted by yohann jardin <yo...@hotmail.com>.

Hey,

Kafka can also do streaming on its own: https://kafka.apache.org/documentation/streams
I don't know much about it unfortunately. I can only repeat what I heard in conferences, saying that one should give a try to Kafka streaming when its whole pipeline is using Kafka. I have no pros/cons to argument on this topic.

Yohann Jardin

Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :

Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark Streaming is used as the real time processing ,Kafka and Spark Streaming work together not competitors.

Spark Streaming is reading data from Kafka and process into micro batching for streaming data, In easy terms collects data for some time, build RDD and then process these micro batches.


Please read doc : https://spark.apache.org/docs/latest/streaming-programming-guide.html


Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning<https://spark.apache.org/docs/latest/ml-guide.html> and graph processing<https://spark.apache.org/docs/latest/graphx-programming-guide.html> algorithms on data streams.


Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali <ka...@gmail.com>> wrote:
Hi All,

I am trying hard to figure out what is the real difference between Kafka Streaming vs Spark Streaming other than saying one can be used as part of Micro services (since Kafka streaming is just a library) and the other is a Standalone framework by itself.

If I can accomplish same job one way or other this is a sort of a puzzling question for me so it would be great to know what Spark streaming can do that Kafka Streaming cannot do efficiently or whatever ?

Thanks!




--
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago

Re: What is the real difference between Kafka streaming and Spark Streaming?

Posted by Paolo Patierno <pp...@live.com>.

I think that a big advantage to not use Spark Streaming when your solution is already based on Kafka is that you don't have to deal with another cluster. I mean ...
Imagine that your solution is already based on Kafka as ingestion systems for your events and then you need to do some real time analysis with streams. Adding Spark means adding a new cluster with a master and one or more nodes then Spark will distribute jobs for you. Using the lightweight streams library from Kafka means just developing a new application for getting events from the same cluster. You can deploy more instances of the same application for load balancing and all is done always by Kafka itself.
I think that in terms of deployment this is a big advantage of using Kafka stream in the same Kafka cluster instead of adding Spark.

Paolo
________________________________
From: kant kodali <ka...@gmail.com>
Sent: Monday, June 12, 2017 12:40:37 AM
To: Mohammed Guller
Cc: vincent gromakowski; yohann jardin; vaquar khan; user
Subject: Re: What is the real difference between Kafka streaming and Spark Streaming?

Also another difference I see is some thing like Spark Sql where there are logical plans, physical plans, Code generation and all those optimizations I don't see them in Kafka Streaming at this time.

On Sun, Jun 11, 2017 at 2:19 PM, kant kodali <ka...@gmail.com>> wrote:
I appreciate the responses however I see the other side of the argument and I actually feel they are competitors now in Streaming space in some sense.

Kafka Streaming can indeed do map, reduce, join and window operations and Like wise data can be ingested from many sources in Kafka and send the results out to many sinks. Look up "Kafka Connect"

Regarding Event at a time vs Micro-batch. I hear arguments from a group of people saying Spark Streaming is real time and other group of people is Kafka streaming is the true real time. so do we say Micro-batch is real time or Event at a time is real time?

It is well known fact that Spark is more popular with Data scientists who want to run ML Algorithms and so on but I also hear that people can use H2O package along with Kafka Streaming. so efficient each of these approaches are is something I have no clue.

The major difference I see is actually the Spark Scheduler I don't think Kafka Streaming has anything like this instead it just allows you to run lambda expressions on a stream and write it out to specific topic/partition and from there one can use Kafka Connect to write it out to any sink. so In short, All the optimizations built into spark scheduler don't seem to exist in Kafka Streaming so if I were to make a decision on which framework to use this is an additional question I would think about like "Do I want my stream to go through the scheduler and if so, why or why not"

Above all, please correct me if I am wrong :)

On Sun, Jun 11, 2017 at 12:41 PM, Mohammed Guller <mo...@glassbeam.com>> wrote:
Just to elaborate more on Vincent wrote – Kafka streaming provides true record-at-a-time processing capabilities whereas Spark Streaming provides micro-batching capabilities on top of Spark. Depending on your use case, you may find one better than the other. Both provide stateless ad stateful stream processing capabilities.

A few more things to consider:

  1.  If you don’t already have a Spark cluster, but have Kafka cluster, it may be easier to use Kafka streaming since you don’t need to setup and manage another cluster.
  2.  On the other hand, if you already have a spark cluster, but don’t have a Kafka cluster (in case you are using some other messaging system), Spark streaming is a better option.
  3.  If you already know and use Spark, you may find it easier to program with Spark Streaming API even if you are using Kafka.
  4.  Spark Streaming may give you better throughput. So you have to decide what is more important for your stream processing application – latency or throughput?
  5.  Kafka streaming is relatively new and less mature than Spark Streaming

Mohammed

From: vincent gromakowski [mailto:vincent.gromakowski@gmail.com<ma...@gmail.com>]
Sent: Sunday, June 11, 2017 12:09 PM
To: yohann jardin <yo...@hotmail.com>>
Cc: kant kodali <ka...@gmail.com>>; vaquar khan <va...@gmail.com>>; user <us...@spark.apache.org>>
Subject: Re: What is the real difference between Kafka streaming and Spark Streaming?

I think Kafka streams is good when the processing of each row is independant from each other (row parsing, data cleaning...)
Spark is better when processing group of rows (group by, ml, window func...)

Le 11 juin 2017 8:15 PM, "yohann jardin" <yo...@hotmail.com>> a écrit :

Hey,
Kafka can also do streaming on its own: https://kafka.apache.org/documentation/streams
I don’t know much about it unfortunately. I can only repeat what I heard in conferences, saying that one should give a try to Kafka streaming when its whole pipeline is using Kafka. I have no pros/cons to argument on this topic.

Yohann Jardin
Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :

Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark Streaming is used as the real time processing ,Kafka and Spark Streaming work together not competitors.
Spark Streaming is reading data from Kafka and process into micro batching for streaming data, In easy terms collects data for some time, build RDD and then process these micro batches.

Please read doc : https://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning<https://spark.apache.org/docs/latest/ml-guide.html> and graph processing<https://spark.apache.org/docs/latest/graphx-programming-guide.html> algorithms on data streams.

Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali <ka...@gmail.com>> wrote:
Hi All,

I am trying hard to figure out what is the real difference between Kafka Streaming vs Spark Streaming other than saying one can be used as part of Micro services (since Kafka streaming is just a library) and the other is a Standalone framework by itself.

If I can accomplish same job one way or other this is a sort of a puzzling question for me so it would be great to know what Spark streaming can do that Kafka Streaming cannot do efficiently or whatever ?

Thanks!

--
Regards,
Vaquar Khan
+1 -224-436-0783<tel:(224)%20436-0783>
Greater Chicago

RE: What is the real difference between Kafka streaming and Spark Streaming?

Posted by Mohammed Guller <mo...@glassbeam.com>.

Regarding Spark scheduler – if you are referring to the ability to distribute workload and scale, Kafka Streaming also provides that capability. It is deceptively simple in that regard if you already have a Kafka cluster. You can launch multiple instances of your Kafka streaming application and Kafka streaming will automatically balance the workload across different instances. It rebalances workload as you add or remove instances. Similarly, if an instance fails or crash, it will automatically detect that.

Regarding real-time – rather than debating which one is real-time, I would look at the latency requirements of my application. For most applications, the near real time capabilities of Spark Streaming might be good enough. For others, it may not.  For example, if I was building a high-frequency trading application, where I want to process individual trades as soon as they happen, I might lean towards Kafka streaming.

Agree about the benefits of using SQL with structured streaming.

Mohammed

From: kant kodali [mailto:kanth909@gmail.com]
Sent: Sunday, June 11, 2017 3:41 PM
To: Mohammed Guller <mo...@glassbeam.com>
Cc: vincent gromakowski <vi...@gmail.com>; yohann jardin <yo...@hotmail.com>; vaquar khan <va...@gmail.com>; user <us...@spark.apache.org>
Subject: Re: What is the real difference between Kafka streaming and Spark Streaming?

Also another difference I see is some thing like Spark Sql where there are logical plans, physical plans, Code generation and all those optimizations I don't see them in Kafka Streaming at this time.

On Sun, Jun 11, 2017 at 2:19 PM, kant kodali <ka...@gmail.com>> wrote:
I appreciate the responses however I see the other side of the argument and I actually feel they are competitors now in Streaming space in some sense.

Kafka Streaming can indeed do map, reduce, join and window operations and Like wise data can be ingested from many sources in Kafka and send the results out to many sinks. Look up "Kafka Connect"

Regarding Event at a time vs Micro-batch. I hear arguments from a group of people saying Spark Streaming is real time and other group of people is Kafka streaming is the true real time. so do we say Micro-batch is real time or Event at a time is real time?

It is well known fact that Spark is more popular with Data scientists who want to run ML Algorithms and so on but I also hear that people can use H2O package along with Kafka Streaming. so efficient each of these approaches are is something I have no clue.

The major difference I see is actually the Spark Scheduler I don't think Kafka Streaming has anything like this instead it just allows you to run lambda expressions on a stream and write it out to specific topic/partition and from there one can use Kafka Connect to write it out to any sink. so In short, All the optimizations built into spark scheduler don't seem to exist in Kafka Streaming so if I were to make a decision on which framework to use this is an additional question I would think about like "Do I want my stream to go through the scheduler and if so, why or why not"

Above all, please correct me if I am wrong :)




On Sun, Jun 11, 2017 at 12:41 PM, Mohammed Guller <mo...@glassbeam.com>> wrote:
Just to elaborate more on Vincent wrote – Kafka streaming provides true record-at-a-time processing capabilities whereas Spark Streaming provides micro-batching capabilities on top of Spark. Depending on your use case, you may find one better than the other. Both provide stateless ad stateful stream processing capabilities.

A few more things to consider:

  1.  If you don’t already have a Spark cluster, but have Kafka cluster, it may be easier to use Kafka streaming since you don’t need to setup and manage another cluster.
  2.  On the other hand, if you already have a spark cluster, but don’t have a Kafka cluster (in case you are using some other messaging system), Spark streaming is a better option.
  3.  If you already know and use Spark, you may find it easier to program with Spark Streaming API even if you are using Kafka.
  4.  Spark Streaming may give you better throughput. So you have to decide what is more important for your stream processing application – latency or throughput?
  5.  Kafka streaming is relatively new and less mature than Spark Streaming

Mohammed

From: vincent gromakowski [mailto:vincent.gromakowski@gmail.com<ma...@gmail.com>]
Sent: Sunday, June 11, 2017 12:09 PM
To: yohann jardin <yo...@hotmail.com>>
Cc: kant kodali <ka...@gmail.com>>; vaquar khan <va...@gmail.com>>; user <us...@spark.apache.org>>
Subject: Re: What is the real difference between Kafka streaming and Spark Streaming?

I think Kafka streams is good when the processing of each row is independant from each other (row parsing, data cleaning...)
Spark is better when processing group of rows (group by, ml, window func...)

Le 11 juin 2017 8:15 PM, "yohann jardin" <yo...@hotmail.com>> a écrit :

Hey,
Kafka can also do streaming on its own: https://kafka.apache.org/documentation/streams
I don’t know much about it unfortunately. I can only repeat what I heard in conferences, saying that one should give a try to Kafka streaming when its whole pipeline is using Kafka. I have no pros/cons to argument on this topic.

Yohann Jardin
Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :

Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark Streaming is used as the real time processing ,Kafka and Spark Streaming work together not competitors.
Spark Streaming is reading data from Kafka and process into micro batching for streaming data, In easy terms collects data for some time, build RDD and then process these micro batches.


Please read doc : https://spark.apache.org/docs/latest/streaming-programming-guide.html


Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning<https://spark.apache.org/docs/latest/ml-guide.html> and graph processing<https://spark.apache.org/docs/latest/graphx-programming-guide.html> algorithms on data streams.


Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali <ka...@gmail.com>> wrote:
Hi All,

I am trying hard to figure out what is the real difference between Kafka Streaming vs Spark Streaming other than saying one can be used as part of Micro services (since Kafka streaming is just a library) and the other is a Standalone framework by itself.

If I can accomplish same job one way or other this is a sort of a puzzling question for me so it would be great to know what Spark streaming can do that Kafka Streaming cannot do efficiently or whatever ?

Thanks!




--
Regards,
Vaquar Khan
+1 -224-436-0783<tel:(224)%20436-0783>
Greater Chicago

Re: What is the real difference between Kafka streaming and Spark Streaming?

Posted by kant kodali <ka...@gmail.com>.

Also another difference I see is some thing like Spark Sql where there are
logical plans, physical plans, Code generation and all those optimizations
I don't see them in Kafka Streaming at this time.

On Sun, Jun 11, 2017 at 2:19 PM, kant kodali <ka...@gmail.com> wrote:

> I appreciate the responses however I see the other side of the argument
> and I actually feel they are competitors now in Streaming space in some
> sense.
>
> Kafka Streaming can indeed do map, reduce, join and window operations and
> Like wise data can be ingested from many sources in Kafka and send the
> results out to many sinks. Look up "Kafka Connect"
>
> Regarding Event at a time vs Micro-batch. I hear arguments from a group of
> people saying Spark Streaming is real time and other group of people is
> Kafka streaming is the true real time. so do we say Micro-batch is real
> time or Event at a time is real time?
>
> It is well known fact that Spark is more popular with Data scientists who
> want to run ML Algorithms and so on but I also hear that people can use H2O
> package along with Kafka Streaming. so efficient each of these approaches
> are is something I have no clue.
>
> The major difference I see is actually the *Spark Scheduler* I don't
> think Kafka Streaming has anything like this instead it just allows you to
> run lambda expressions on a stream and write it out to specific
> topic/partition and from there one can use Kafka Connect to write it out to
> any sink. so In short, All the optimizations built into spark scheduler
> don't seem to exist in Kafka Streaming so if I were to make a decision on
> which framework to use this is an additional question I would think about
> like "Do I want my stream to go through the scheduler and if so, why or why
> not"
>
> Above all, please correct me if I am wrong :)
>
>
>
>
> On Sun, Jun 11, 2017 at 12:41 PM, Mohammed Guller <mo...@glassbeam.com>
> wrote:
>
>> Just to elaborate more on Vincent wrote – Kafka streaming provides true
>> record-at-a-time processing capabilities whereas Spark Streaming provides
>> micro-batching capabilities on top of Spark. Depending on your use case,
>> you may find one better than the other. Both provide stateless ad stateful
>> stream processing capabilities.
>>
>>
>>
>> A few more things to consider:
>>
>>    1. If you don’t already have a Spark cluster, but have Kafka cluster,
>>    it may be easier to use Kafka streaming since you don’t need to setup and
>>    manage another cluster.
>>    2. On the other hand, if you already have a spark cluster, but don’t
>>    have a Kafka cluster (in case you are using some other messaging system),
>>    Spark streaming is a better option.
>>    3. If you already know and use Spark, you may find it easier to
>>    program with Spark Streaming API even if you are using Kafka.
>>    4. Spark Streaming may give you better throughput. So you have to
>>    decide what is more important for your stream processing application –
>>    latency or throughput?
>>    5. Kafka streaming is relatively new and less mature than Spark
>>    Streaming
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* vincent gromakowski [mailto:vincent.gromakowski@gmail.com]
>> *Sent:* Sunday, June 11, 2017 12:09 PM
>> *To:* yohann jardin <yo...@hotmail.com>
>> *Cc:* kant kodali <ka...@gmail.com>; vaquar khan <
>> vaquar.khan@gmail.com>; user <us...@spark.apache.org>
>> *Subject:* Re: What is the real difference between Kafka streaming and
>> Spark Streaming?
>>
>>
>>
>> I think Kafka streams is good when the processing of each row is
>> independant from each other (row parsing, data cleaning...)
>>
>> Spark is better when processing group of rows (group by, ml, window
>> func...)
>>
>>
>>
>> Le 11 juin 2017 8:15 PM, "yohann jardin" <yo...@hotmail.com> a
>> écrit :
>>
>> Hey,
>>
>> Kafka can also do streaming on its own: https://kafka.apache.org/docum
>> entation/streams
>> I don’t know much about it unfortunately. I can only repeat what I heard
>> in conferences, saying that one should give a try to Kafka streaming when
>> its whole pipeline is using Kafka. I have no pros/cons to argument on this
>> topic.
>>
>> *Yohann Jardin*
>>
>> Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :
>>
>> Hi Kant,
>>
>> Kafka is the message broker that using as Producers and Consumers and
>> Spark Streaming is used as the real time processing ,Kafka and Spark
>> Streaming work together not competitors.
>>
>> Spark Streaming is reading data from Kafka and process into micro
>> batching for streaming data, In easy terms collects data for some time,
>> build RDD and then process these micro batches.
>>
>>
>>
>>
>>
>> Please read doc : https://spark.apache.org/doc
>> s/latest/streaming-programming-guide.html
>>
>>
>>
>> Spark Streaming is an extension of the core Spark API that enables
>> scalable, high-throughput, fault-tolerant stream processing of live data
>> streams. Data can be ingested from many sources like *Kafka, Flume,
>> Kinesis, or TCP sockets*, and can be processed using complex algorithms
>> expressed with high-level functions like map, reduce, join and window.
>> Finally, processed data can be pushed out to filesystems, databases, and
>> live dashboards. In fact, you can apply Spark’s machine learning
>> <https://spark.apache.org/docs/latest/ml-guide.html> and graph processing
>> <https://spark.apache.org/docs/latest/graphx-programming-guide.html> algorithms
>> on data streams.
>>
>>
>>
>> Regards,
>>
>> Vaquar khan
>>
>>
>>
>> On Sun, Jun 11, 2017 at 3:12 AM, kant kodali <ka...@gmail.com> wrote:
>>
>> Hi All,
>>
>>
>>
>> I am trying hard to figure out what is the real difference between Kafka
>> Streaming vs Spark Streaming other than saying one can be used as part of
>> Micro services (since Kafka streaming is just a library) and the other is a
>> Standalone framework by itself.
>>
>>
>>
>> If I can accomplish same job one way or other this is a sort of a
>> puzzling question for me so it would be great to know what Spark streaming
>> can do that Kafka Streaming cannot do efficiently or whatever ?
>>
>>
>>
>> Thanks!
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Regards,
>>
>> Vaquar Khan
>>
>> +1 -224-436-0783 <(224)%20436-0783>
>>
>> Greater Chicago
>>
>>
>>
>>
>>
>>
>

Re: What is the real difference between Kafka streaming and Spark Streaming?

Posted by kant kodali <ka...@gmail.com>.

I appreciate the responses however I see the other side of the argument and
I actually feel they are competitors now in Streaming space in some sense.

Kafka Streaming can indeed do map, reduce, join and window operations and
Like wise data can be ingested from many sources in Kafka and send the
results out to many sinks. Look up "Kafka Connect"

Regarding Event at a time vs Micro-batch. I hear arguments from a group of
people saying Spark Streaming is real time and other group of people is
Kafka streaming is the true real time. so do we say Micro-batch is real
time or Event at a time is real time?

It is well known fact that Spark is more popular with Data scientists who
want to run ML Algorithms and so on but I also hear that people can use H2O
package along with Kafka Streaming. so efficient each of these approaches
are is something I have no clue.

The major difference I see is actually the *Spark Scheduler* I don't think
Kafka Streaming has anything like this instead it just allows you to run
lambda expressions on a stream and write it out to specific topic/partition
and from there one can use Kafka Connect to write it out to any sink. so In
short, All the optimizations built into spark scheduler don't seem to exist
in Kafka Streaming so if I were to make a decision on which framework to
use this is an additional question I would think about like "Do I want my
stream to go through the scheduler and if so, why or why not"

Above all, please correct me if I am wrong :)




On Sun, Jun 11, 2017 at 12:41 PM, Mohammed Guller <mo...@glassbeam.com>
wrote:

> Just to elaborate more on Vincent wrote – Kafka streaming provides true
> record-at-a-time processing capabilities whereas Spark Streaming provides
> micro-batching capabilities on top of Spark. Depending on your use case,
> you may find one better than the other. Both provide stateless ad stateful
> stream processing capabilities.
>
>
>
> A few more things to consider:
>
>    1. If you don’t already have a Spark cluster, but have Kafka cluster,
>    it may be easier to use Kafka streaming since you don’t need to setup and
>    manage another cluster.
>    2. On the other hand, if you already have a spark cluster, but don’t
>    have a Kafka cluster (in case you are using some other messaging system),
>    Spark streaming is a better option.
>    3. If you already know and use Spark, you may find it easier to
>    program with Spark Streaming API even if you are using Kafka.
>    4. Spark Streaming may give you better throughput. So you have to
>    decide what is more important for your stream processing application –
>    latency or throughput?
>    5. Kafka streaming is relatively new and less mature than Spark
>    Streaming
>
>
>
> Mohammed
>
>
>
> *From:* vincent gromakowski [mailto:vincent.gromakowski@gmail.com]
> *Sent:* Sunday, June 11, 2017 12:09 PM
> *To:* yohann jardin <yo...@hotmail.com>
> *Cc:* kant kodali <ka...@gmail.com>; vaquar khan <va...@gmail.com>;
> user <us...@spark.apache.org>
> *Subject:* Re: What is the real difference between Kafka streaming and
> Spark Streaming?
>
>
>
> I think Kafka streams is good when the processing of each row is
> independant from each other (row parsing, data cleaning...)
>
> Spark is better when processing group of rows (group by, ml, window
> func...)
>
>
>
> Le 11 juin 2017 8:15 PM, "yohann jardin" <yo...@hotmail.com> a
> écrit :
>
> Hey,
>
> Kafka can also do streaming on its own: https://kafka.apache.org/
> documentation/streams
> I don’t know much about it unfortunately. I can only repeat what I heard
> in conferences, saying that one should give a try to Kafka streaming when
> its whole pipeline is using Kafka. I have no pros/cons to argument on this
> topic.
>
> *Yohann Jardin*
>
> Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :
>
> Hi Kant,
>
> Kafka is the message broker that using as Producers and Consumers and
> Spark Streaming is used as the real time processing ,Kafka and Spark
> Streaming work together not competitors.
>
> Spark Streaming is reading data from Kafka and process into micro batching
> for streaming data, In easy terms collects data for some time, build RDD
> and then process these micro batches.
>
>
>
>
>
> Please read doc : https://spark.apache.org/docs/latest/streaming-
> programming-guide.html
>
>
>
> Spark Streaming is an extension of the core Spark API that enables
> scalable, high-throughput, fault-tolerant stream processing of live data
> streams. Data can be ingested from many sources like *Kafka, Flume,
> Kinesis, or TCP sockets*, and can be processed using complex algorithms
> expressed with high-level functions like map, reduce, join and window.
> Finally, processed data can be pushed out to filesystems, databases, and
> live dashboards. In fact, you can apply Spark’s machine learning
> <https://spark.apache.org/docs/latest/ml-guide.html> and graph processing
> <https://spark.apache.org/docs/latest/graphx-programming-guide.html> algorithms
> on data streams.
>
>
>
> Regards,
>
> Vaquar khan
>
>
>
> On Sun, Jun 11, 2017 at 3:12 AM, kant kodali <ka...@gmail.com> wrote:
>
> Hi All,
>
>
>
> I am trying hard to figure out what is the real difference between Kafka
> Streaming vs Spark Streaming other than saying one can be used as part of
> Micro services (since Kafka streaming is just a library) and the other is a
> Standalone framework by itself.
>
>
>
> If I can accomplish same job one way or other this is a sort of a puzzling
> question for me so it would be great to know what Spark streaming can do
> that Kafka Streaming cannot do efficiently or whatever ?
>
>
>
> Thanks!
>
>
>
>
>
>
>
> --
>
> Regards,
>
> Vaquar Khan
>
> +1 -224-436-0783 <(224)%20436-0783>
>
> Greater Chicago
>
>
>
>
>
>

RE: What is the real difference between Kafka streaming and Spark Streaming?

Posted by Mohammed Guller <mo...@glassbeam.com>.

Just to elaborate more on Vincent wrote – Kafka streaming provides true record-at-a-time processing capabilities whereas Spark Streaming provides micro-batching capabilities on top of Spark. Depending on your use case, you may find one better than the other. Both provide stateless ad stateful stream processing capabilities.

A few more things to consider:

  1.  If you don’t already have a Spark cluster, but have Kafka cluster, it may be easier to use Kafka streaming since you don’t need to setup and manage another cluster.
  2.  On the other hand, if you already have a spark cluster, but don’t have a Kafka cluster (in case you are using some other messaging system), Spark streaming is a better option.
  3.  If you already know and use Spark, you may find it easier to program with Spark Streaming API even if you are using Kafka.
  4.  Spark Streaming may give you better throughput. So you have to decide what is more important for your stream processing application – latency or throughput?
  5.  Kafka streaming is relatively new and less mature than Spark Streaming

Mohammed

From: vincent gromakowski [mailto:vincent.gromakowski@gmail.com]
Sent: Sunday, June 11, 2017 12:09 PM
To: yohann jardin <yo...@hotmail.com>
Cc: kant kodali <ka...@gmail.com>; vaquar khan <va...@gmail.com>; user <us...@spark.apache.org>
Subject: Re: What is the real difference between Kafka streaming and Spark Streaming?

I think Kafka streams is good when the processing of each row is independant from each other (row parsing, data cleaning...)
Spark is better when processing group of rows (group by, ml, window func...)

Le 11 juin 2017 8:15 PM, "yohann jardin" <yo...@hotmail.com>> a écrit :

Hey,
Kafka can also do streaming on its own: https://kafka.apache.org/documentation/streams
I don’t know much about it unfortunately. I can only repeat what I heard in conferences, saying that one should give a try to Kafka streaming when its whole pipeline is using Kafka. I have no pros/cons to argument on this topic.

Yohann Jardin
Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :

Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark Streaming is used as the real time processing ,Kafka and Spark Streaming work together not competitors.
Spark Streaming is reading data from Kafka and process into micro batching for streaming data, In easy terms collects data for some time, build RDD and then process these micro batches.


Please read doc : https://spark.apache.org/docs/latest/streaming-programming-guide.html


Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning<https://spark.apache.org/docs/latest/ml-guide.html> and graph processing<https://spark.apache.org/docs/latest/graphx-programming-guide.html> algorithms on data streams.


Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali <ka...@gmail.com>> wrote:
Hi All,

I am trying hard to figure out what is the real difference between Kafka Streaming vs Spark Streaming other than saying one can be used as part of Micro services (since Kafka streaming is just a library) and the other is a Standalone framework by itself.

If I can accomplish same job one way or other this is a sort of a puzzling question for me so it would be great to know what Spark streaming can do that Kafka Streaming cannot do efficiently or whatever ?

Thanks!




--
Regards,
Vaquar Khan
+1 -224-436-0783<tel:(224)%20436-0783>
Greater Chicago

Re: What is the real difference between Kafka streaming and Spark Streaming?

Posted by vincent gromakowski <vi...@gmail.com>.

I think Kafka streams is good when the processing of each row is
independant from each other (row parsing, data cleaning...)
Spark is better when processing group of rows (group by, ml, window func...)

Le 11 juin 2017 8:15 PM, "yohann jardin" <yo...@hotmail.com> a
écrit :

Hey,
Kafka can also do streaming on its own: https://kafka.apache.org/
documentation/streams
I don’t know much about it unfortunately. I can only repeat what I heard in
conferences, saying that one should give a try to Kafka streaming when its
whole pipeline is using Kafka. I have no pros/cons to argument on this
topic.

*Yohann Jardin*
Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :

Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark
Streaming is used as the real time processing ,Kafka and Spark Streaming
work together not competitors.
Spark Streaming is reading data from Kafka and process into micro batching
for streaming data, In easy terms collects data for some time, build RDD
and then process these micro batches.

Please read doc : https://spark.apache.org/docs/latest/streaming-
programming-guide.html

Spark Streaming is an extension of the core Spark API that enables
scalable, high-throughput, fault-tolerant stream processing of live data
streams. Data can be ingested from many sources like *Kafka, Flume,
Kinesis, or TCP sockets*, and can be processed using complex algorithms
expressed with high-level functions like map, reduce, join and window.
Finally, processed data can be pushed out to filesystems, databases, and
live dashboards. In fact, you can apply Spark’s machine learning
<https://spark.apache.org/docs/latest/ml-guide.html> and graph processing
<https://spark.apache.org/docs/latest/graphx-programming-guide.html> algorithms
on data streams.

Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali <ka...@gmail.com> wrote:

> Hi All,
>
> I am trying hard to figure out what is the real difference between Kafka
> Streaming vs Spark Streaming other than saying one can be used as part of
> Micro services (since Kafka streaming is just a library) and the other is a
> Standalone framework by itself.
>
> If I can accomplish same job one way or other this is a sort of a puzzling
> question for me so it would be great to know what Spark streaming can do
> that Kafka Streaming cannot do efficiently or whatever ?
>
> Thanks!
>
>

-- 
Regards,
Vaquar Khan
+1 -224-436-0783 <(224)%20436-0783>
Greater Chicago

Re: What is the real difference between Kafka streaming and Spark Streaming?

Posted by yohann jardin <yo...@hotmail.com>.

Hey,

Kafka can also do streaming on its own: https://kafka.apache.org/documentation/streams
I don’t know much about it unfortunately. I can only repeat what I heard in conferences, saying that one should give a try to Kafka streaming when its whole pipeline is using Kafka. I have no pros/cons to argument on this topic.

Yohann Jardin

Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :

Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark Streaming is used as the real time processing ,Kafka and Spark Streaming work together not competitors.

Spark Streaming is reading data from Kafka and process into micro batching for streaming data, In easy terms collects data for some time, build RDD and then process these micro batches.


Please read doc : https://spark.apache.org/docs/latest/streaming-programming-guide.html


Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning<https://spark.apache.org/docs/latest/ml-guide.html> and graph processing<https://spark.apache.org/docs/latest/graphx-programming-guide.html> algorithms on data streams.


Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali <ka...@gmail.com>> wrote:
Hi All,

I am trying hard to figure out what is the real difference between Kafka Streaming vs Spark Streaming other than saying one can be used as part of Micro services (since Kafka streaming is just a library) and the other is a Standalone framework by itself.

If I can accomplish same job one way or other this is a sort of a puzzling question for me so it would be great to know what Spark streaming can do that Kafka Streaming cannot do efficiently or whatever ?

Thanks!




--
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago

Re: What is the real difference between Kafka streaming and Spark Streaming?

Posted by vaquar khan <va...@gmail.com>.

Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark
Streaming is used as the real time processing ,Kafka and Spark Streaming
work together not competitors.
Spark Streaming is reading data from Kafka and process into micro batching
for streaming data, In easy terms collects data for some time, build RDD
and then process these micro batches.

Please read doc :
https://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming is an extension of the core Spark API that enables
scalable, high-throughput, fault-tolerant stream processing of live data
streams. Data can be ingested from many sources like *Kafka, Flume,
Kinesis, or TCP sockets*, and can be processed using complex algorithms
expressed with high-level functions like map, reduce, join and window.
Finally, processed data can be pushed out to filesystems, databases, and
live dashboards. In fact, you can apply Spark’s machine learning
<https://spark.apache.org/docs/latest/ml-guide.html> and graph processing
<https://spark.apache.org/docs/latest/graphx-programming-guide.html> algorithms
on data streams.

Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali <ka...@gmail.com> wrote:

> Hi All,
>
> I am trying hard to figure out what is the real difference between Kafka
> Streaming vs Spark Streaming other than saying one can be used as part of
> Micro services (since Kafka streaming is just a library) and the other is a
> Standalone framework by itself.
>
> If I can accomplish same job one way or other this is a sort of a puzzling
> question for me so it would be great to know what Spark streaming can do
> that Kafka Streaming cannot do efficiently or whatever ?
>
> Thanks!
>
>

-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago