You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Athanasios Kordelas <at...@gmail.com> on 2023/01/26 09:10:48 UTC

Question regarding Spark 3.X performance

Hi all,

I'm running some tests on spark streaming (not structured) for my PhD, and
I'm seeing an extreme improvement when using Spark/Kafka 3.3.1 versus
Spark/Kafka 2.4.8/Kafka 2.7.0.

My (scala) application code is as follows:

*KafkaStream* => foreachRDD => mapPartitions => repartition => GroupBy
=> .*agg(expr("percentile(value,
array(0.25, 0.5, 0.75))")) *=> take(2)

In short, a two core executor could process 600.000 rows of key/value pairs
in 60 seconds with Spark 2.x, while now, with Spark 3.3.1, the same
processing (same code) can be achieved in 5-10 seconds.

@apache-spark, @spark-streaming, @spark-mllib, @spark-ml, is there a
significant optimization that could explain this improvement?

BR,
Athanasios Kordelas

Re: Question regarding Spark 3.X performance

Posted by Mich Talebzadeh <mi...@gmail.com>.

I assume that the batch interval and code line remains the same. You need
to verify this. note that it sounds you are processing 1000 more records
with spark 3.x that spark 2 (601,000 - 600,000)

OK let us have a look by observing the matrices for each release.


   -

   *Processing Time*: The time it takes to compute a given batch for all
   its jobs, end to end. We see a major difference here, 6s for processing
   601,000 with Spark 3.x compared to 57s with Spark 2.x for processing
   600,000 records.
   -

   *Scheduling Delay*: The time taken by Spark Streaming scheduler to
   submit the jobs of the batch. In all cases this is negligible compared to
   processing time.
   -

   *Total Delay*: This is Scheduling Delay + Processing Time.

so your major gain is in processing the rows and you are stating the
codeline has not changed.

I read you created a  dstream, something like below


val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](streamingContext, kafkaParams, topics)

dstream.cache()

The term dstream.foreachRDD
<https://stackoverflow.com/questions/36421619/whats-the-meaning-of-dstream-foreachrdd-function>
is
an output operator in Spark Streaming. It allows one to access the
underlying RDDs of the DStream to execute actions that do something
practical with the data. For example, using foreachRDD we could write data
to a database.


// Work on every Stream dstream.foreachRDD { someRDD => if (!someRDD.isEmpty)
// data exists in RDD { // do something So my suggestion is that you need
to manually measure timing within each stream and ascertain the cause of
delay/improventens for each spark version. HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.





On Fri, 27 Jan 2023 at 11:27, Athanasios Kordelas <
athanasioskordelas@gmail.com> wrote:

> Re-sending Spark 2 img:
> [image: image.png]
>
>
> --Thanasis
>
> Στις Παρ 27 Ιαν 2023 στις 1:03 μ.μ., ο/η Mich Talebzadeh <
> mich.talebzadeh@gmail.com> έγραψε:
>
>> OK, great. I can zoom into spark 3 but not spark 2!
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 27 Jan 2023 at 10:58, Athanasios Kordelas <
>> athanasioskordelas@gmail.com> wrote:
>>
>>> Hi again Mich,
>>>
>>>
>>> I think all the information is already provided in the previously
>>> attached file. All the extra time is due to extra processing time, and this
>>> is why I'm wondering if indeed there is a very good optimization in spark
>>> streaming or percentiles which could explain this behavior.
>>> The batch interval of the streaming application (not structure
>>> streaming) is set to 60 seconds for this test.
>>>
>>> Spark 3:
>>>
>>> [image: image.png]
>>>
>>> Spark 2:
>>>
>>> [image: image.png]
>>>
>>>
>>> --Thanasis
>>>
>>>
>>> Στις Παρ 27 Ιαν 2023 στις 12:03 μ.μ., ο/η Athanasios Kordelas <
>>> athanasioskordelas@gmail.com> έγραψε:
>>>
>>>> Hi Mich,
>>>>
>>>> I'll gather them and send them to you :)
>>>>
>>>> Many thanks,
>>>> Thanasis
>>>>
>>>> Στις Παρ 27 Ιαν 2023 στις 11:40 π.μ., ο/η Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> έγραψε:
>>>>
>>>>>
>>>>> Hi Athanasios
>>>>>
>>>>>
>>>>> Thanks for the details.  Since I believe this is Spark streaming, the
>>>>> all important indicator is the Processing Time defined by Spark GUI
>>>>> as Time taken to process all jobs of a batch versus the batch
>>>>> interval. The Scheduling Delay and the Total Delay are additional
>>>>> indicators of health.  Do you have these stats for both versions?
>>>>>
>>>>>
>>>>> cheers
>>>>>
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 27 Jan 2023 at 09:03, Athanasios Kordelas <
>>>>> athanasioskordelas@gmail.com> wrote:
>>>>>
>>>>>> Hi Mich,
>>>>>>
>>>>>> Thank you for your reply. For my benchmark test, I'm only using one
>>>>>> executor with two cores in both cases.
>>>>>> I had created a large image with multiple UI screenshots a few days
>>>>>> ago, so I'm attaching it (please zoom in).
>>>>>> You can see spark 3 on the left side versus spark 2 on the right.
>>>>>>
>>>>>> I can collect more info by triggering new runs if this would help,
>>>>>> but I'm not sure what is the best way to provide you with all the matrix
>>>>>> data, maybe from logs?
>>>>>>
>>>>>> --Thanasis
>>>>>>
>>>>>>
>>>>>>
>>>>>> Στις Πέμ 26 Ιαν 2023 στις 10:03 μ.μ., ο/η Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> έγραψε:
>>>>>>
>>>>>>> You have given some stats, 5-10 sec vs 60 sec with set-up and
>>>>>>> systematics being the same for both tests?
>>>>>>>
>>>>>>> so let us assume we see with 3.3.1, <10> sec average time versus 60
>>>>>>> with the older spark 2.x
>>>>>>>
>>>>>>> so that gives us (60-10) = 50*100/60) ~ 80% gain
>>>>>>>
>>>>>>> However, that would not tell us why the 3.3,.1 excels in detail. For
>>>>>>> that you need to look at the Spark GUI matrix.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 26 Jan 2023 at 16:51, Mich Talebzadeh <
>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>
>>>>>>>> Please qualify what you mean by* extreme improvements*?
>>>>>>>>
>>>>>>>> What matrix are you using?
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 26 Jan 2023 at 13:06, Athanasios Kordelas <
>>>>>>>> athanasioskordelas@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I'm running some tests on spark streaming (not structured) for my
>>>>>>>>> PhD, and I'm seeing an extreme improvement when using Spark/Kafka 3.3.1
>>>>>>>>> versus Spark/Kafka 2.4.8/Kafka 2.7.0.
>>>>>>>>>
>>>>>>>>> My (scala) application code is as follows:
>>>>>>>>>
>>>>>>>>> *KafkaStream* => foreachRDD => mapPartitions => repartition =>
>>>>>>>>> GroupBy => .*agg(expr("percentile(value, array(0.25, 0.5,
>>>>>>>>> 0.75))")) *=> take(2)
>>>>>>>>>
>>>>>>>>> In short, a two core executor could process 600.000 rows of
>>>>>>>>> key/value pairs in 60 seconds with Spark 2.x, while now, with Spark 3.3.1,
>>>>>>>>> the same processing (same code) can be achieved in 5-10 seconds.
>>>>>>>>>
>>>>>>>>> @apache-spark, @spark-streaming, @spark-mllib, @spark-ml, is there
>>>>>>>>> a significant optimization that could explain this improvement?
>>>>>>>>>
>>>>>>>>> BR,
>>>>>>>>> Athanasios Kordelas
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Athanasios Kordelas
>>>>>> Staff SW Engineer
>>>>>> T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
>>>>>> athanasioskordelas@gmail.com
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Athanasios Kordelas
>>>> Staff SW Engineer
>>>> T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
>>>> athanasioskordelas@gmail.com
>>>>
>>>>
>>>
>>> --
>>> Athanasios Kordelas
>>> Staff SW Engineer
>>> T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
>>> athanasioskordelas@gmail.com
>>>
>>>
>
> --
> Athanasios Kordelas
> Staff SW Engineer
> T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
> athanasioskordelas@gmail.com
>
>

Re: Question regarding Spark 3.X performance

Posted by Athanasios Kordelas <at...@gmail.com>.

Re-sending Spark 2 img:
[image: image.png]


--Thanasis

Στις Παρ 27 Ιαν 2023 στις 1:03 μ.μ., ο/η Mich Talebzadeh <
mich.talebzadeh@gmail.com> έγραψε:

> OK, great. I can zoom into spark 3 but not spark 2!
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 27 Jan 2023 at 10:58, Athanasios Kordelas <
> athanasioskordelas@gmail.com> wrote:
>
>> Hi again Mich,
>>
>>
>> I think all the information is already provided in the previously
>> attached file. All the extra time is due to extra processing time, and this
>> is why I'm wondering if indeed there is a very good optimization in spark
>> streaming or percentiles which could explain this behavior.
>> The batch interval of the streaming application (not structure streaming)
>> is set to 60 seconds for this test.
>>
>> Spark 3:
>>
>> [image: image.png]
>>
>> Spark 2:
>>
>> [image: image.png]
>>
>>
>> --Thanasis
>>
>>
>> Στις Παρ 27 Ιαν 2023 στις 12:03 μ.μ., ο/η Athanasios Kordelas <
>> athanasioskordelas@gmail.com> έγραψε:
>>
>>> Hi Mich,
>>>
>>> I'll gather them and send them to you :)
>>>
>>> Many thanks,
>>> Thanasis
>>>
>>> Στις Παρ 27 Ιαν 2023 στις 11:40 π.μ., ο/η Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> έγραψε:
>>>
>>>>
>>>> Hi Athanasios
>>>>
>>>>
>>>> Thanks for the details.  Since I believe this is Spark streaming, the
>>>> all important indicator is the Processing Time defined by Spark GUI as Time
>>>> taken to process all jobs of a batch versus the batch interval. The Scheduling
>>>> Delay and the Total Delay are additional indicators of health.  Do you
>>>> have these stats for both versions?
>>>>
>>>>
>>>> cheers
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, 27 Jan 2023 at 09:03, Athanasios Kordelas <
>>>> athanasioskordelas@gmail.com> wrote:
>>>>
>>>>> Hi Mich,
>>>>>
>>>>> Thank you for your reply. For my benchmark test, I'm only using one
>>>>> executor with two cores in both cases.
>>>>> I had created a large image with multiple UI screenshots a few days
>>>>> ago, so I'm attaching it (please zoom in).
>>>>> You can see spark 3 on the left side versus spark 2 on the right.
>>>>>
>>>>> I can collect more info by triggering new runs if this would help, but
>>>>> I'm not sure what is the best way to provide you with all the matrix data,
>>>>> maybe from logs?
>>>>>
>>>>> --Thanasis
>>>>>
>>>>>
>>>>>
>>>>> Στις Πέμ 26 Ιαν 2023 στις 10:03 μ.μ., ο/η Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> έγραψε:
>>>>>
>>>>>> You have given some stats, 5-10 sec vs 60 sec with set-up and
>>>>>> systematics being the same for both tests?
>>>>>>
>>>>>> so let us assume we see with 3.3.1, <10> sec average time versus 60
>>>>>> with the older spark 2.x
>>>>>>
>>>>>> so that gives us (60-10) = 50*100/60) ~ 80% gain
>>>>>>
>>>>>> However, that would not tell us why the 3.3,.1 excels in detail. For
>>>>>> that you need to look at the Spark GUI matrix.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, 26 Jan 2023 at 16:51, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>>> Please qualify what you mean by* extreme improvements*?
>>>>>>>
>>>>>>> What matrix are you using?
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 26 Jan 2023 at 13:06, Athanasios Kordelas <
>>>>>>> athanasioskordelas@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I'm running some tests on spark streaming (not structured) for my
>>>>>>>> PhD, and I'm seeing an extreme improvement when using Spark/Kafka 3.3.1
>>>>>>>> versus Spark/Kafka 2.4.8/Kafka 2.7.0.
>>>>>>>>
>>>>>>>> My (scala) application code is as follows:
>>>>>>>>
>>>>>>>> *KafkaStream* => foreachRDD => mapPartitions => repartition =>
>>>>>>>> GroupBy => .*agg(expr("percentile(value, array(0.25, 0.5,
>>>>>>>> 0.75))")) *=> take(2)
>>>>>>>>
>>>>>>>> In short, a two core executor could process 600.000 rows of
>>>>>>>> key/value pairs in 60 seconds with Spark 2.x, while now, with Spark 3.3.1,
>>>>>>>> the same processing (same code) can be achieved in 5-10 seconds.
>>>>>>>>
>>>>>>>> @apache-spark, @spark-streaming, @spark-mllib, @spark-ml, is there
>>>>>>>> a significant optimization that could explain this improvement?
>>>>>>>>
>>>>>>>> BR,
>>>>>>>> Athanasios Kordelas
>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>> Athanasios Kordelas
>>>>> Staff SW Engineer
>>>>> T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
>>>>> athanasioskordelas@gmail.com
>>>>>
>>>>>
>>>
>>> --
>>> Athanasios Kordelas
>>> Staff SW Engineer
>>> T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
>>> athanasioskordelas@gmail.com
>>>
>>>
>>
>> --
>> Athanasios Kordelas
>> Staff SW Engineer
>> T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
>> athanasioskordelas@gmail.com
>>
>>

-- 
Athanasios Kordelas
Staff SW Engineer
T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
athanasioskordelas@gmail.com

Re: Question regarding Spark 3.X performance

Posted by Mich Talebzadeh <mi...@gmail.com>.

OK, great. I can zoom into spark 3 but not spark 2!



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 27 Jan 2023 at 10:58, Athanasios Kordelas <
athanasioskordelas@gmail.com> wrote:

> Hi again Mich,
>
>
> I think all the information is already provided in the previously attached
> file. All the extra time is due to extra processing time, and this is why
> I'm wondering if indeed there is a very good optimization in spark
> streaming or percentiles which could explain this behavior.
> The batch interval of the streaming application (not structure streaming)
> is set to 60 seconds for this test.
>
> Spark 3:
>
> [image: image.png]
>
> Spark 2:
>
> [image: image.png]
>
>
> --Thanasis
>
>
> Στις Παρ 27 Ιαν 2023 στις 12:03 μ.μ., ο/η Athanasios Kordelas <
> athanasioskordelas@gmail.com> έγραψε:
>
>> Hi Mich,
>>
>> I'll gather them and send them to you :)
>>
>> Many thanks,
>> Thanasis
>>
>> Στις Παρ 27 Ιαν 2023 στις 11:40 π.μ., ο/η Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> έγραψε:
>>
>>>
>>> Hi Athanasios
>>>
>>>
>>> Thanks for the details.  Since I believe this is Spark streaming, the
>>> all important indicator is the Processing Time defined by Spark GUI as Time
>>> taken to process all jobs of a batch versus the batch interval. The Scheduling
>>> Delay and the Total Delay are additional indicators of health.  Do you
>>> have these stats for both versions?
>>>
>>>
>>> cheers
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 27 Jan 2023 at 09:03, Athanasios Kordelas <
>>> athanasioskordelas@gmail.com> wrote:
>>>
>>>> Hi Mich,
>>>>
>>>> Thank you for your reply. For my benchmark test, I'm only using one
>>>> executor with two cores in both cases.
>>>> I had created a large image with multiple UI screenshots a few days
>>>> ago, so I'm attaching it (please zoom in).
>>>> You can see spark 3 on the left side versus spark 2 on the right.
>>>>
>>>> I can collect more info by triggering new runs if this would help, but
>>>> I'm not sure what is the best way to provide you with all the matrix data,
>>>> maybe from logs?
>>>>
>>>> --Thanasis
>>>>
>>>>
>>>>
>>>> Στις Πέμ 26 Ιαν 2023 στις 10:03 μ.μ., ο/η Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> έγραψε:
>>>>
>>>>> You have given some stats, 5-10 sec vs 60 sec with set-up and
>>>>> systematics being the same for both tests?
>>>>>
>>>>> so let us assume we see with 3.3.1, <10> sec average time versus 60
>>>>> with the older spark 2.x
>>>>>
>>>>> so that gives us (60-10) = 50*100/60) ~ 80% gain
>>>>>
>>>>> However, that would not tell us why the 3.3,.1 excels in detail. For
>>>>> that you need to look at the Spark GUI matrix.
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, 26 Jan 2023 at 16:51, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> Please qualify what you mean by* extreme improvements*?
>>>>>>
>>>>>> What matrix are you using?
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, 26 Jan 2023 at 13:06, Athanasios Kordelas <
>>>>>> athanasioskordelas@gmail.com> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'm running some tests on spark streaming (not structured) for my
>>>>>>> PhD, and I'm seeing an extreme improvement when using Spark/Kafka 3.3.1
>>>>>>> versus Spark/Kafka 2.4.8/Kafka 2.7.0.
>>>>>>>
>>>>>>> My (scala) application code is as follows:
>>>>>>>
>>>>>>> *KafkaStream* => foreachRDD => mapPartitions => repartition =>
>>>>>>> GroupBy => .*agg(expr("percentile(value, array(0.25, 0.5, 0.75))"))
>>>>>>> *=> take(2)
>>>>>>>
>>>>>>> In short, a two core executor could process 600.000 rows of
>>>>>>> key/value pairs in 60 seconds with Spark 2.x, while now, with Spark 3.3.1,
>>>>>>> the same processing (same code) can be achieved in 5-10 seconds.
>>>>>>>
>>>>>>> @apache-spark, @spark-streaming, @spark-mllib, @spark-ml, is there a
>>>>>>> significant optimization that could explain this improvement?
>>>>>>>
>>>>>>> BR,
>>>>>>> Athanasios Kordelas
>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>> Athanasios Kordelas
>>>> Staff SW Engineer
>>>> T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
>>>> athanasioskordelas@gmail.com
>>>>
>>>>
>>
>> --
>> Athanasios Kordelas
>> Staff SW Engineer
>> T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
>> athanasioskordelas@gmail.com
>>
>>
>
> --
> Athanasios Kordelas
> Staff SW Engineer
> T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
> athanasioskordelas@gmail.com
>
>

Re: Question regarding Spark 3.X performance

Posted by Athanasios Kordelas <at...@gmail.com>.

Hi again Mich,


I think all the information is already provided in the previously attached
file. All the extra time is due to extra processing time, and this is why
I'm wondering if indeed there is a very good optimization in spark
streaming or percentiles which could explain this behavior.
The batch interval of the streaming application (not structure streaming)
is set to 60 seconds for this test.

Spark 3:

[image: image.png]

Spark 2:

[image: image.png]


--Thanasis


Στις Παρ 27 Ιαν 2023 στις 12:03 μ.μ., ο/η Athanasios Kordelas <
athanasioskordelas@gmail.com> έγραψε:

> Hi Mich,
>
> I'll gather them and send them to you :)
>
> Many thanks,
> Thanasis
>
> Στις Παρ 27 Ιαν 2023 στις 11:40 π.μ., ο/η Mich Talebzadeh <
> mich.talebzadeh@gmail.com> έγραψε:
>
>>
>> Hi Athanasios
>>
>>
>> Thanks for the details.  Since I believe this is Spark streaming, the
>> all important indicator is the Processing Time defined by Spark GUI as Time
>> taken to process all jobs of a batch versus the batch interval. The Scheduling
>> Delay and the Total Delay are additional indicators of health.  Do you
>> have these stats for both versions?
>>
>>
>> cheers
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 27 Jan 2023 at 09:03, Athanasios Kordelas <
>> athanasioskordelas@gmail.com> wrote:
>>
>>> Hi Mich,
>>>
>>> Thank you for your reply. For my benchmark test, I'm only using one
>>> executor with two cores in both cases.
>>> I had created a large image with multiple UI screenshots a few days ago,
>>> so I'm attaching it (please zoom in).
>>> You can see spark 3 on the left side versus spark 2 on the right.
>>>
>>> I can collect more info by triggering new runs if this would help, but
>>> I'm not sure what is the best way to provide you with all the matrix data,
>>> maybe from logs?
>>>
>>> --Thanasis
>>>
>>>
>>>
>>> Στις Πέμ 26 Ιαν 2023 στις 10:03 μ.μ., ο/η Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> έγραψε:
>>>
>>>> You have given some stats, 5-10 sec vs 60 sec with set-up and
>>>> systematics being the same for both tests?
>>>>
>>>> so let us assume we see with 3.3.1, <10> sec average time versus 60
>>>> with the older spark 2.x
>>>>
>>>> so that gives us (60-10) = 50*100/60) ~ 80% gain
>>>>
>>>> However, that would not tell us why the 3.3,.1 excels in detail. For
>>>> that you need to look at the Spark GUI matrix.
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, 26 Jan 2023 at 16:51, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> Please qualify what you mean by* extreme improvements*?
>>>>>
>>>>> What matrix are you using?
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, 26 Jan 2023 at 13:06, Athanasios Kordelas <
>>>>> athanasioskordelas@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm running some tests on spark streaming (not structured) for my
>>>>>> PhD, and I'm seeing an extreme improvement when using Spark/Kafka 3.3.1
>>>>>> versus Spark/Kafka 2.4.8/Kafka 2.7.0.
>>>>>>
>>>>>> My (scala) application code is as follows:
>>>>>>
>>>>>> *KafkaStream* => foreachRDD => mapPartitions => repartition =>
>>>>>> GroupBy => .*agg(expr("percentile(value, array(0.25, 0.5, 0.75))")) *=>
>>>>>> take(2)
>>>>>>
>>>>>> In short, a two core executor could process 600.000 rows of
>>>>>> key/value pairs in 60 seconds with Spark 2.x, while now, with Spark 3.3.1,
>>>>>> the same processing (same code) can be achieved in 5-10 seconds.
>>>>>>
>>>>>> @apache-spark, @spark-streaming, @spark-mllib, @spark-ml, is there a
>>>>>> significant optimization that could explain this improvement?
>>>>>>
>>>>>> BR,
>>>>>> Athanasios Kordelas
>>>>>>
>>>>>>
>>>
>>> --
>>> Athanasios Kordelas
>>> Staff SW Engineer
>>> T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
>>> athanasioskordelas@gmail.com
>>>
>>>
>
> --
> Athanasios Kordelas
> Staff SW Engineer
> T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
> athanasioskordelas@gmail.com
>
>

-- 
Athanasios Kordelas
Staff SW Engineer
T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
athanasioskordelas@gmail.com

Re: Question regarding Spark 3.X performance

Posted by Athanasios Kordelas <at...@gmail.com>.

Hi Mich,

I'll gather them and send them to you :)

Many thanks,
Thanasis

Στις Παρ 27 Ιαν 2023 στις 11:40 π.μ., ο/η Mich Talebzadeh <
mich.talebzadeh@gmail.com> έγραψε:

>
> Hi Athanasios
>
>
> Thanks for the details.  Since I believe this is Spark streaming, the all
> important indicator is the Processing Time defined by Spark GUI as Time
> taken to process all jobs of a batch versus the batch interval. The Scheduling
> Delay and the Total Delay are additional indicators of health.  Do you
> have these stats for both versions?
>
>
> cheers
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 27 Jan 2023 at 09:03, Athanasios Kordelas <
> athanasioskordelas@gmail.com> wrote:
>
>> Hi Mich,
>>
>> Thank you for your reply. For my benchmark test, I'm only using one
>> executor with two cores in both cases.
>> I had created a large image with multiple UI screenshots a few days ago,
>> so I'm attaching it (please zoom in).
>> You can see spark 3 on the left side versus spark 2 on the right.
>>
>> I can collect more info by triggering new runs if this would help, but
>> I'm not sure what is the best way to provide you with all the matrix data,
>> maybe from logs?
>>
>> --Thanasis
>>
>>
>>
>> Στις Πέμ 26 Ιαν 2023 στις 10:03 μ.μ., ο/η Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> έγραψε:
>>
>>> You have given some stats, 5-10 sec vs 60 sec with set-up and
>>> systematics being the same for both tests?
>>>
>>> so let us assume we see with 3.3.1, <10> sec average time versus 60 with
>>> the older spark 2.x
>>>
>>> so that gives us (60-10) = 50*100/60) ~ 80% gain
>>>
>>> However, that would not tell us why the 3.3,.1 excels in detail. For
>>> that you need to look at the Spark GUI matrix.
>>>
>>> HTH
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 26 Jan 2023 at 16:51, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>>> Please qualify what you mean by* extreme improvements*?
>>>>
>>>> What matrix are you using?
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, 26 Jan 2023 at 13:06, Athanasios Kordelas <
>>>> athanasioskordelas@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm running some tests on spark streaming (not structured) for my PhD,
>>>>> and I'm seeing an extreme improvement when using Spark/Kafka 3.3.1 versus
>>>>> Spark/Kafka 2.4.8/Kafka 2.7.0.
>>>>>
>>>>> My (scala) application code is as follows:
>>>>>
>>>>> *KafkaStream* => foreachRDD => mapPartitions => repartition =>
>>>>> GroupBy => .*agg(expr("percentile(value, array(0.25, 0.5, 0.75))")) *=>
>>>>> take(2)
>>>>>
>>>>> In short, a two core executor could process 600.000 rows of
>>>>> key/value pairs in 60 seconds with Spark 2.x, while now, with Spark 3.3.1,
>>>>> the same processing (same code) can be achieved in 5-10 seconds.
>>>>>
>>>>> @apache-spark, @spark-streaming, @spark-mllib, @spark-ml, is there a
>>>>> significant optimization that could explain this improvement?
>>>>>
>>>>> BR,
>>>>> Athanasios Kordelas
>>>>>
>>>>>
>>
>> --
>> Athanasios Kordelas
>> Staff SW Engineer
>> T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
>> athanasioskordelas@gmail.com
>>
>>

-- 
Athanasios Kordelas
Staff SW Engineer
T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
athanasioskordelas@gmail.com

Re: Question regarding Spark 3.X performance

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Athanasios


Thanks for the details.  Since I believe this is Spark streaming, the all
important indicator is the Processing Time defined by Spark GUI as Time
taken to process all jobs of a batch versus the batch interval. The Scheduling
Delay and the Total Delay are additional indicators of health.  Do you have
these stats for both versions?


cheers



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 27 Jan 2023 at 09:03, Athanasios Kordelas <
athanasioskordelas@gmail.com> wrote:

> Hi Mich,
>
> Thank you for your reply. For my benchmark test, I'm only using one
> executor with two cores in both cases.
> I had created a large image with multiple UI screenshots a few days ago,
> so I'm attaching it (please zoom in).
> You can see spark 3 on the left side versus spark 2 on the right.
>
> I can collect more info by triggering new runs if this would help, but I'm
> not sure what is the best way to provide you with all the matrix data,
> maybe from logs?
>
> --Thanasis
>
>
>
> Στις Πέμ 26 Ιαν 2023 στις 10:03 μ.μ., ο/η Mich Talebzadeh <
> mich.talebzadeh@gmail.com> έγραψε:
>
>> You have given some stats, 5-10 sec vs 60 sec with set-up and systematics
>> being the same for both tests?
>>
>> so let us assume we see with 3.3.1, <10> sec average time versus 60 with
>> the older spark 2.x
>>
>> so that gives us (60-10) = 50*100/60) ~ 80% gain
>>
>> However, that would not tell us why the 3.3,.1 excels in detail. For that
>> you need to look at the Spark GUI matrix.
>>
>> HTH
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 26 Jan 2023 at 16:51, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>>> Please qualify what you mean by* extreme improvements*?
>>>
>>> What matrix are you using?
>>>
>>> HTH
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 26 Jan 2023 at 13:06, Athanasios Kordelas <
>>> athanasioskordelas@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm running some tests on spark streaming (not structured) for my PhD,
>>>> and I'm seeing an extreme improvement when using Spark/Kafka 3.3.1 versus
>>>> Spark/Kafka 2.4.8/Kafka 2.7.0.
>>>>
>>>> My (scala) application code is as follows:
>>>>
>>>> *KafkaStream* => foreachRDD => mapPartitions => repartition => GroupBy
>>>> => .*agg(expr("percentile(value, array(0.25, 0.5, 0.75))")) *=> take(2)
>>>>
>>>> In short, a two core executor could process 600.000 rows of
>>>> key/value pairs in 60 seconds with Spark 2.x, while now, with Spark 3.3.1,
>>>> the same processing (same code) can be achieved in 5-10 seconds.
>>>>
>>>> @apache-spark, @spark-streaming, @spark-mllib, @spark-ml, is there a
>>>> significant optimization that could explain this improvement?
>>>>
>>>> BR,
>>>> Athanasios Kordelas
>>>>
>>>>
>
> --
> Athanasios Kordelas
> Staff SW Engineer
> T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
> athanasioskordelas@gmail.com
>
>

Re: Question regarding Spark 3.X performance

Posted by Athanasios Kordelas <at...@gmail.com>.

Hi Mich,

Thank you for your reply. For my benchmark test, I'm only using one
executor with two cores in both cases.
I had created a large image with multiple UI screenshots a few days ago, so
I'm attaching it (please zoom in).
You can see spark 3 on the left side versus spark 2 on the right.

I can collect more info by triggering new runs if this would help, but I'm
not sure what is the best way to provide you with all the matrix data,
maybe from logs?

--Thanasis



Στις Πέμ 26 Ιαν 2023 στις 10:03 μ.μ., ο/η Mich Talebzadeh <
mich.talebzadeh@gmail.com> έγραψε:

> You have given some stats, 5-10 sec vs 60 sec with set-up and systematics
> being the same for both tests?
>
> so let us assume we see with 3.3.1, <10> sec average time versus 60 with
> the older spark 2.x
>
> so that gives us (60-10) = 50*100/60) ~ 80% gain
>
> However, that would not tell us why the 3.3,.1 excels in detail. For that
> you need to look at the Spark GUI matrix.
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 26 Jan 2023 at 16:51, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Please qualify what you mean by* extreme improvements*?
>>
>> What matrix are you using?
>>
>> HTH
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 26 Jan 2023 at 13:06, Athanasios Kordelas <
>> athanasioskordelas@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm running some tests on spark streaming (not structured) for my PhD,
>>> and I'm seeing an extreme improvement when using Spark/Kafka 3.3.1 versus
>>> Spark/Kafka 2.4.8/Kafka 2.7.0.
>>>
>>> My (scala) application code is as follows:
>>>
>>> *KafkaStream* => foreachRDD => mapPartitions => repartition => GroupBy
>>> => .*agg(expr("percentile(value, array(0.25, 0.5, 0.75))")) *=> take(2)
>>>
>>> In short, a two core executor could process 600.000 rows of
>>> key/value pairs in 60 seconds with Spark 2.x, while now, with Spark 3.3.1,
>>> the same processing (same code) can be achieved in 5-10 seconds.
>>>
>>> @apache-spark, @spark-streaming, @spark-mllib, @spark-ml, is there a
>>> significant optimization that could explain this improvement?
>>>
>>> BR,
>>> Athanasios Kordelas
>>>
>>>

-- 
Athanasios Kordelas
Staff SW Engineer
T: +30 6972053674 | Skype: athanasios.kordelas@outlook.com.gr
athanasioskordelas@gmail.com

Re: Question regarding Spark 3.X performance

Posted by Mich Talebzadeh <mi...@gmail.com>.

You have given some stats, 5-10 sec vs 60 sec with set-up and systematics
being the same for both tests?

so let us assume we see with 3.3.1, <10> sec average time versus 60 with
the older spark 2.x

so that gives us (60-10) = 50*100/60) ~ 80% gain

However, that would not tell us why the 3.3,.1 excels in detail. For that
you need to look at the Spark GUI matrix.

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 26 Jan 2023 at 16:51, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Please qualify what you mean by* extreme improvements*?
>
> What matrix are you using?
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 26 Jan 2023 at 13:06, Athanasios Kordelas <
> athanasioskordelas@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm running some tests on spark streaming (not structured) for my PhD,
>> and I'm seeing an extreme improvement when using Spark/Kafka 3.3.1 versus
>> Spark/Kafka 2.4.8/Kafka 2.7.0.
>>
>> My (scala) application code is as follows:
>>
>> *KafkaStream* => foreachRDD => mapPartitions => repartition => GroupBy
>> => .*agg(expr("percentile(value, array(0.25, 0.5, 0.75))")) *=> take(2)
>>
>> In short, a two core executor could process 600.000 rows of
>> key/value pairs in 60 seconds with Spark 2.x, while now, with Spark 3.3.1,
>> the same processing (same code) can be achieved in 5-10 seconds.
>>
>> @apache-spark, @spark-streaming, @spark-mllib, @spark-ml, is there a
>> significant optimization that could explain this improvement?
>>
>> BR,
>> Athanasios Kordelas
>>
>>

Re: Question regarding Spark 3.X performance

Posted by Mich Talebzadeh <mi...@gmail.com>.

Please qualify what you mean by* extreme improvements*?

What matrix are you using?

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 26 Jan 2023 at 13:06, Athanasios Kordelas <
athanasioskordelas@gmail.com> wrote:

> Hi all,
>
> I'm running some tests on spark streaming (not structured) for my PhD, and
> I'm seeing an extreme improvement when using Spark/Kafka 3.3.1 versus
> Spark/Kafka 2.4.8/Kafka 2.7.0.
>
> My (scala) application code is as follows:
>
> *KafkaStream* => foreachRDD => mapPartitions => repartition => GroupBy =>
> .*agg(expr("percentile(value, array(0.25, 0.5, 0.75))")) *=> take(2)
>
> In short, a two core executor could process 600.000 rows of
> key/value pairs in 60 seconds with Spark 2.x, while now, with Spark 3.3.1,
> the same processing (same code) can be achieved in 5-10 seconds.
>
> @apache-spark, @spark-streaming, @spark-mllib, @spark-ml, is there a
> significant optimization that could explain this improvement?
>
> BR,
> Athanasios Kordelas
>
>