You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Joris Billen <jo...@bigindustries.be> on 2022/11/02 08:53:23 UTC
should one every make a spark streaming job in pyspark
Dear community,
I had a general question about the use of scala VS pyspark for spark streaming.
I believe spark streaming will work most efficiently when written in scala. I believe however that things can be implemented in pyspark. My question:
1)is it completely dumb to make a streaming job in pyspark?
2)what are the technical reasons that it is done best in scala (is this easy to understand why)?
3)any good links anyone has seen with numbers of the difference in performance and under what circumstances+explanation?
4)are there certain scenarios when the use of pyspark can be motivated (maybe when someone doesn’t feel confortable writing a job in scala and the number of messages/minute aren’t gigantic so performance isnt that crucial?)
Thanks for any input!
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: Re: should one every make a spark streaming job in pyspark
Posted by Lingzhe Sun <li...@hirain.com>.
In addition to that:
For now some stateful operations in structured streaming don't have equivalent python API, e.g. flatMapGroupsWithState. However spark engineers are making it possible in the upcoming version. See more: https://www.databricks.com/blog/2022/10/18/python-arbitrary-stateful-processing-structured-streaming.html
Best Regards!
...........................................................................
Lingzhe Sun
Hirain Technology / APIC
From: Mich Talebzadeh
Date: 2022-11-03 19:15
To: Joris Billen
CC: User
Subject: Re: should one every make a spark streaming job in pyspark
Well your mileage varies so to speak.
Spark itself is written in Scala. However, that does not imply you should stick with Scala.
I have used both for spark streaming and spark structured streaming, they both work fine
PySpark has become popular with the widespread use of iData Science projects
What matters normally is the skill set you already have in-house. The likelihood is that there are more Python developers than Scala developers and the learning curve for scala has to be taken into account
The idea of performance etc is tangential.
With regard to the Spark code itself, there should be little efforts in converting from Scala to PySpark or vice-versa
HTH
view my Linkedin profile
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
On Wed, 2 Nov 2022 at 08:54, Joris Billen <jo...@bigindustries.be> wrote:
Dear community,
I had a general question about the use of scala VS pyspark for spark streaming.
I believe spark streaming will work most efficiently when written in scala. I believe however that things can be implemented in pyspark. My question:
1)is it completely dumb to make a streaming job in pyspark?
2)what are the technical reasons that it is done best in scala (is this easy to understand why)?
3)any good links anyone has seen with numbers of the difference in performance and under what circumstances+explanation?
4)are there certain scenarios when the use of pyspark can be motivated (maybe when someone doesn’t feel confortable writing a job in scala and the number of messages/minute aren’t gigantic so performance isnt that crucial?)
Thanks for any input!
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: should one every make a spark streaming job in pyspark
Posted by Mich Talebzadeh <mi...@gmail.com>.
Well your mileage varies so to speak.
- Spark itself is written in Scala. However, that does not imply you
should stick with Scala.
- I have used both for spark streaming and spark structured streaming,
they both work fine
- PySpark has become popular with the widespread use of iData Science
projects
- What matters normally is the skill set you already have in-house. The
likelihood is that there are more Python developers than Scala developers
and the learning curve for scala has to be taken into account
- The idea of performance etc is tangential.
- With regard to the Spark code itself, there should be little efforts
in converting from Scala to PySpark or vice-versa
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Wed, 2 Nov 2022 at 08:54, Joris Billen <jo...@bigindustries.be>
wrote:
> Dear community,
> I had a general question about the use of scala VS pyspark for spark
> streaming.
> I believe spark streaming will work most efficiently when written in
> scala. I believe however that things can be implemented in pyspark. My
> question:
> 1)is it completely dumb to make a streaming job in pyspark?
> 2)what are the technical reasons that it is done best in scala (is this
> easy to understand why)?
> 3)any good links anyone has seen with numbers of the difference in
> performance and under what circumstances+explanation?
> 4)are there certain scenarios when the use of pyspark can be motivated
> (maybe when someone doesn’t feel confortable writing a job in scala and the
> number of messages/minute aren’t gigantic so performance isnt that crucial?)
>
> Thanks for any input!
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>