You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Joris Billen <jo...@bigindustries.be> on 2022/11/02 08:53:23 UTC

should one every make a spark streaming job in pyspark

Dear community, 
I had a general question about the use of scala VS pyspark for spark streaming.
I believe spark streaming will work most efficiently when written in scala. I believe however that things can be implemented in pyspark. My question: 
1)is it completely dumb to make a streaming job in pyspark? 
2)what are the technical reasons that it is done best in scala (is this easy to understand why)? 
3)any good links anyone has seen with numbers of the difference in performance and under what circumstances+explanation?
4)are there certain scenarios when the use of pyspark can be motivated (maybe when someone doesn’t feel confortable writing a job in scala and the number of messages/minute aren’t gigantic so performance isnt that crucial?)

Thanks for any input!
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Re: should one every make a spark streaming job in pyspark

Posted by Lingzhe Sun <li...@hirain.com>.
In addition to that:

For now some stateful operations in structured streaming don't have equivalent python API, e.g. flatMapGroupsWithState. However spark engineers are making it possible in the upcoming version. See more: https://www.databricks.com/blog/2022/10/18/python-arbitrary-stateful-processing-structured-streaming.html



Best Regards!
...........................................................................
Lingzhe Sun 
Hirain Technology / APIC
 
From: Mich Talebzadeh
Date: 2022-11-03 19:15
To: Joris Billen
CC: User
Subject: Re: should one every make a spark streaming job in pyspark
Well your mileage varies so to speak.

Spark itself is written in Scala. However, that does not imply you should stick with Scala.
I have used both for spark streaming and spark structured streaming, they both work fine
PySpark has become popular with the widespread use of iData Science projects
What matters normally is the skill set you already have in-house. The likelihood is that there are more Python developers than Scala developers and the learning curve for scala has to be taken into account
The idea of performance etc is tangential.
 With regard to the Spark code itself, there should be little efforts in converting from Scala to PySpark or vice-versa
HTH

   view my Linkedin profile

  
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. 
 


On Wed, 2 Nov 2022 at 08:54, Joris Billen <jo...@bigindustries.be> wrote:
Dear community, 
I had a general question about the use of scala VS pyspark for spark streaming.
I believe spark streaming will work most efficiently when written in scala. I believe however that things can be implemented in pyspark. My question: 
1)is it completely dumb to make a streaming job in pyspark? 
2)what are the technical reasons that it is done best in scala (is this easy to understand why)? 
3)any good links anyone has seen with numbers of the difference in performance and under what circumstances+explanation?
4)are there certain scenarios when the use of pyspark can be motivated (maybe when someone doesn’t feel confortable writing a job in scala and the number of messages/minute aren’t gigantic so performance isnt that crucial?)

Thanks for any input!
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: should one every make a spark streaming job in pyspark

Posted by Mich Talebzadeh <mi...@gmail.com>.
Well your mileage varies so to speak.


   - Spark itself is written in Scala. However, that does not imply you
   should stick with Scala.
   - I have used both for spark streaming and spark structured streaming,
   they both work fine
   - PySpark has become popular with the widespread use of iData Science
   projects
   - What matters normally is the skill set you already have in-house. The
   likelihood is that there are more Python developers than Scala developers
   and the learning curve for scala has to be taken into account
   - The idea of performance etc is tangential.
   -  With regard to the Spark code itself, there should be little efforts
   in converting from Scala to PySpark or vice-versa

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 2 Nov 2022 at 08:54, Joris Billen <jo...@bigindustries.be>
wrote:

> Dear community,
> I had a general question about the use of scala VS pyspark for spark
> streaming.
> I believe spark streaming will work most efficiently when written in
> scala. I believe however that things can be implemented in pyspark. My
> question:
> 1)is it completely dumb to make a streaming job in pyspark?
> 2)what are the technical reasons that it is done best in scala (is this
> easy to understand why)?
> 3)any good links anyone has seen with numbers of the difference in
> performance and under what circumstances+explanation?
> 4)are there certain scenarios when the use of pyspark can be motivated
> (maybe when someone doesn’t feel confortable writing a job in scala and the
> number of messages/minute aren’t gigantic so performance isnt that crucial?)
>
> Thanks for any input!
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>